---
format:
    html:
        embed-resources: true
---

# Crawling 

## Overview 

In this portion of the homework, you will be crawling google jobs to collect various job-descriptions for later processing. 

We will be using the `serpapi` API to crawl google jobs. `serpapi` is a paid API, but they have a free tier which should be more than enough for this homework. The API allows you to search google programatically, which has a wealth of practical applications.

You will need and API key from SerpApi, it is free to sign up, you shouldn't enter any payment information.

https://serpapi.com/manage-api-key

This will get you 100 free searches per month. Each search will get you 10 job-descriptions, for a total of around 1000 possible job-descriptions per month. 

This portion of the homework relies on limited API search resources, so prototype carefully, once you are sure it is working 100%, then you should run it one last time. 

Consider reserving about 10 searches for prototyping, with the remaining 90 searches for your final "production" run.

Make sure to save the outputs, and have them backed up after this final run, in case you delete them by mistake.

We will use the following python wrapper for the API, it can be installed with

`pip install google-search-results`

The following are additional useful reference resources 

For instructions on the API see the following

* [https://serpapi.com/google-jobs-api](https://serpapi.com/google-jobs-api)
* [https://serpapi.com/blog/scrape-google-jobs-organic-results-with-python/](https://serpapi.com/blog/scrape-google-jobs-organic-results-with-python/)
* [https://serpapi.com/integrations/python](https://serpapi.com/integrations/python)

## Starter code 

Here is some starter code:

`Note: uule parameter`

The uule parameter is an encoded location parameter used in Google search queries. It stands for "Unique User Location Encoding" and is used to specify the geographic location from which the search is being conducted. This can influence the search results to be more relevant to the specified location.

This can be set to `'w+CAIQICINVW5pdGVkIFN0YXRlcw'`, which is an encoded string representing a specific location, i.e. the United States. This encoding helps simulate searches as if they are being conducted from that location, which can be useful for testing or gathering location-specific data.

In [1]:
from serpapi import GoogleSearch
import json

Save your API key in a centralized location, e.g. `~/.api-keys.json`

Read it in with `import json` 

In [2]:
import json
with open('/Users/james/.api-keys.json') as f:
    keys = json.load(f)
API_KEY = keys['serpapi']

Be careful, don't run this too many times for debugging and prototyping, or you will use up all your free searches. 

In [3]:
search_query = 'data science'
params = {
	'api_key':API_KEY,                          # https://serpapi.com/manage-api-key
	'uule': 'w+CAIQICINVW5pdGVkIFN0YXRlcw',		# encoded location (USA)
	'q': search_query,              			# search query
    'hl': 'en',                         		# language of the search
    'gl': 'us',                         		# country of the search
	'engine': 'google_jobs',					# SerpApi search engine
}

Lets do one search and explore the output 

In [4]:
search = GoogleSearch(params)   			# where data extraction happens on the SerpApi backend
result_dict = search.get_dict() 			# JSON -> Python dict

if 'error' in result_dict:
    print("ERROR FOUND IN SEARCH")

In [7]:
for result in result_dict['jobs_results']:
    print(result)
    # google_jobs_results.append(result)

{'title': 'Intern Data Science AI&I - PHD (On-site)', 'company_name': 'Mayo Clinic', 'location': 'Rochester, MN', 'via': 'Mayo Clinic Careers', 'share_link': 'https://www.google.com/search?ibp=htl;jobs&q=data+science&htidocid=yR0hQNEpSKswbeZYAAAAAA%3D%3D&hl=en-US&shndl=-1&source=sh/x/job/li/m1/1#fpstate=tldetail&htivrt=jobs&htiq=data+science&htidocid=yR0hQNEpSKswbeZYAAAAAA%3D%3D', 'thumbnail': 'https://serpapi.com/searches/671544b8316650605d45bc2c/images/38d82cd27cffaaa7d885388f748141ba8257bfd16e379fa3e64fc0cf0bc18251.png', 'extensions': ['4 days ago', 'Full-time and Internship'], 'detected_extensions': {'posted_at': '4 days ago', 'schedule_type': 'Full-time and Internship'}, 'description': 'Why Mayo Clinic\n\nMayo Clinic is top-ranked in more specialties than any other care provider according to U.S. News & World Report. As we work together to put the needs of the patient first, we are also dedicated to our employees, investing in competitive compensation and comprehensive benefit pla

In web crawling, pagination involves retrieving data across multiple pages of a website. It helps manage large datasets by fetching a limited number of results per request, enabling efficient data extraction without overwhelming the server or exceeding resource limits.

You could use pagination to get more results for a given search, however you need to start where the last search left off.

You can do this by adding the `next_page_token` to the `params` dictionary.

You get the last `next_page_token` from the `result_dict["serpapi_pagination"]`

If you don't do pagination, and search 10 times, you will just get the first 10 results over and over again.

In [11]:
print(result_dict["serpapi_pagination"],"\n")
print(result_dict["serpapi_pagination"]["next_page_token"])

{'next_page_token': 'eyJmYyI6IkVxSUVDdUlEUVVwSE9VcHJUakZETlhGaGFHbGlaRFJ3V0ZSMVVYWnZOR3hrUmtWTFowZE5XV1kxYkdaemQyMUpka1ZZUmpBeE1YaFZTV05PU20wNU1YbHRTbFJ1VFU5blVuWmZUV3MwZUMxb1drZE5SRFV0VjJ4dlIzZEtRMGhaTlZjeVVVWlBNa2RKWTNSUGVFNHlNV0l0VDI1VmVITXRaUzFhTUVwdGNYTk9PWGRUUzJ4VU5WRlVNelpEYkRoSk1VaHVZMUZSV2xkbVYwTmlWVkJ1YlY5MFlYbFNaMmxhY0hVNU1HTmlZMlJhWlc5MVVDMXZjRFZhTW05bVRGZHNWa1F0ZW1KSlgwaHROWEZYYVhKemMzZEVSVE5FUzBGM2MyUjFOakEzZEY5RFJtWktjbTVrVmxKTVRrZG5WelJLUzFGSE5IaHdWbWhqVW0xWk1VTnlRM1YxTFRScU9ERkhjMnBITUd4TWVXWjBkRTVOYTJKZlJIUmlOVFI0VWpFelRFSm5abFozUWxSMU5FaExUbU5vY1ZaUU5XczVVbVU0WTBwWVlWUm1iRU5DZEhwU2F6ZENiRmN4TlRkZlpXaEJOMGhoYURKQk1rMW5ZV3RhZW5SUFFteFRkV2hpYlhFNUxVazVaVEJ0TWxwSWVFTnJZME5zTWxWVFVXSjFVbUZMVnpOc1ptcHJkV1pTZUcxdldqRjJSVGh4YkVwT09FUktNVWR3ZG5GSVgxcFBUMVZ4U0ZZME9IaFZZa3ROTm1kYU1rNHhWMVYwTUcxQk1HaGpRMXBRWTBFeE5VaGFZV04wYVdsMGNXZDRlR295YldKalNWSnVUemxuVmxObk1XRnNUazVzVDB4c2NtY1NGM1ZWVVZaYU9XcDBSV0ZRU1hCMFVWQXljMlozWjFGWkdpSkJSbGh5UldOeE1YcGlVakphVlZKRk16SkJaRE5aYlRkdk9FTTJVR1J

Lets do one more search, but this time with pagination, starting where the last search left off.

In [12]:
search_query = 'data science'
params = {
	'api_key':API_KEY,                          # https://serpapi.com/manage-api-key
	'uule': 'w+CAIQICINVW5pdGVkIFN0YXRlcw',		# encoded location (USA)
	'q': search_query,              			# search query
    'hl': 'en',                         		# language of the search
    'gl': 'us',                         		# country of the search
    "num": 10,									# number of results per page
	'engine': 'google_jobs',					# SerpApi search engine
    'next_page_token': result_dict["serpapi_pagination"]["next_page_token"]
}

In [13]:
search = GoogleSearch(params)   			# where data extraction happens on the SerpApi backend
result_dict = search.get_dict() 			# JSON -> Python dict

if 'error' in result_dict:
    print("ERROR FOUND IN SEARCH")

In [14]:
for result in result_dict['jobs_results']:
    print(result)
    # google_jobs_results.append(result)

{'title': 'Principal Data Scientist', 'company_name': 'Hp', 'location': 'Colorado, TX', 'via': 'ZipRecruiter', 'share_link': 'https://www.google.com/search?ibp=htl;jobs&q=data+science&htidocid=Q8aAld1zBtNh7HBHAAAAAA%3D%3D&hl=en-US&shndl=-1&source=sh/x/job/li/m1/1#fpstate=tldetail&htivrt=jobs&htiq=data+science&htidocid=Q8aAld1zBtNh7HBHAAAAAA%3D%3D', 'thumbnail': 'https://serpapi.com/searches/67154854400bd50bf3e5c040/images/72889b0014d22365ed2a0cd83cdc7b94235cb5df4ed96b34b8259ad67bd9691f.gif', 'extensions': ['2 days ago', 'Full-time', 'Dental insurance', 'Health insurance', 'Paid time off'], 'detected_extensions': {'posted_at': '2 days ago', 'schedule_type': 'Full-time', 'dental_coverage': True, 'health_insurance': True, 'paid_time_off': True}, 'description': "Principal Data Scientist\n\nDescription -\n\nHP's Digital and Transformation Organization (D&TO) is focused on building world-class digital capabilities, and our Data Science team is responsible for leading the development the data

# Utility function

Create utility function to search google jobs, and save the results to a file.

Here is one sketch of what the function might look like:

- Imports the current date and time using `datetime`.
- Defines `search_google_jobs` to perform a Google Jobs search with a default or custom query.
- Accepts parameters for the search query, pagination token, and verbosity.
- Sets search parameters like API key, location, language, and search engine.
- Appends the pagination token if provided.
- Creates a timestamped output filename based on the query and time.
- Does a search and data extraction.
- Optionally prints the data if `verbose` is `True` and saves results to a JSON file.
- Returns the `next_page_token` for pagination or handles errors.

In [58]:
# INSERT CODE HERE

In [40]:
next_page_token = search_google_jobs(search_query="machine learning engineer", verbose=True)

{
    "api_key": "7e36aef3febfe33db57c49fda7c0cb7e55b0c40b8879104b7389344b8fa8c235",
    "uule": "w+CAIQICINVW5pdGVkIFN0YXRlcw",
    "q": "machine learning engineer",
    "hl": "en",
    "gl": "us",
    "engine": "google_jobs",
    "output": "json",
    "source": "python"
}
data/machine-learning-engineer-2024-10-20-15-04-53.json
{
    "search_metadata": {
        "id": "67155457f6aa536356d23238",
        "status": "Success",
        "json_endpoint": "https://serpapi.com/searches/25610ff1cab56768/67155457f6aa536356d23238.json",
        "created_at": "2024-10-20 19:04:55 UTC",
        "processed_at": "2024-10-20 19:04:55 UTC",
        "google_jobs_url": "https://www.google.com/search?q=machine+learning+engineer&uule=w+CAIQICINVW5pdGVkIFN0YXRlcw&hl=en&gl=us&udm=8",
        "raw_html_file": "https://serpapi.com/searches/25610ff1cab56768/67155457f6aa536356d23238.html",
        "total_time_taken": 1.59
    },
    "search_parameters": {
        "q": "machine learning engineer",
        "engin

In [46]:
next_page_token

'eyJmYyI6IkVyWUVDdmNEUVVwSE9VcHJUazVIY21seVFreEthM3B1ZG01UlRGYzJNblZFV0daUGNHRk1TM05DZHpKVGR6SlNZWFZKVm1vek4wdzRhRUp0WlhCd05sVm1OVFJPVkdOSWVubDVSMVJGTFdaUmVXc3hkV2RsUkhoM1NsTTVNbk5PZVZoWlpIQk1ZbmcwWHpCblRETllSVGhFVGs5d0xVeDZSWGxEZFVKR2QxOVBPR1UxY0RCUU9Fd3hhRVpOTVhWYU5IRllaRVV0YlZsWlpqbFZlSFUxZG05dlgxSndZWGhzYldkQ0xUVmhhV0pKT0U1ZmNEUlVOMjFhUzE5aVpVRmFVR1J5VGpjNVZqaE9iMU41ZUhaWlVUQTJUMlpWTlhaNFRHTmhaSEIwYmtkUWMyVlliV0kzYUVkNFJtWm5NR1V3UlMxMUxUQnFYekpyV0hneGMyOXJMVEJDU1ZwQ2VHWlFjRFZKYzFCbldEQkNRVzlSUVZSYVNtTm5jWGN0Ym5wcFJXTjRiazlLT1VGc2EyczRkVzl6VTBseVZ6bHFhMk5HY0RWcmMwVnpXa05DZUVnMGMwdHVaSGw1VUVNMFNXNUdNVlYwVG5wbE1qQjRVelpFU0ZZNWVuVjFjazUxWVcwMVFuUk9NSEJzTUdwS1JVdGZNUzFuVUZOamRIRk5iSEJaTFVSU1RYVnZaMWxGWlZCVGRsZDRaMmhGY214YVUwRlBVRmRaUkhOSVUwOUJlSFphVm1STWFtSlFjRTFZU0RWWlRsZE9SRWRaVWtSSWRrSlNXbXRMTjJaTmNHTXRNVTlzTmtKcFpHbG5ZbVJ4YTI5RVNrSkJhRXh3UTBKWlRtcHNTREpWWVVoYVZubzBRazFEVkV4TFNsZHRUMVZWZEZaRlZtaHhXRTBTRmxkR1VWWmFMVko1ZFU1MWJURkJYMk15Y1Zob1ExRWFJa0ZHV0hKRlkyOTBUMkZvUzNwaVEzcGpjMkZ4V0V

In [47]:
search_google_jobs(search_query="machine learning engineer", next_page_token=next_page_token, verbose=True)

{
    "api_key": "7e36aef3febfe33db57c49fda7c0cb7e55b0c40b8879104b7389344b8fa8c235",
    "uule": "w+CAIQICINVW5pdGVkIFN0YXRlcw",
    "q": "machine learning engineer",
    "hl": "en",
    "gl": "us",
    "engine": "google_jobs",
    "next_page_token": "eyJmYyI6IkVyWUVDdmNEUVVwSE9VcHJUazVIY21seVFreEthM3B1ZG01UlRGYzJNblZFV0daUGNHRk1TM05DZHpKVGR6SlNZWFZKVm1vek4wdzRhRUp0WlhCd05sVm1OVFJPVkdOSWVubDVSMVJGTFdaUmVXc3hkV2RsUkhoM1NsTTVNbk5PZVZoWlpIQk1ZbmcwWHpCblRETllSVGhFVGs5d0xVeDZSWGxEZFVKR2QxOVBPR1UxY0RCUU9Fd3hhRVpOTVhWYU5IRllaRVV0YlZsWlpqbFZlSFUxZG05dlgxSndZWGhzYldkQ0xUVmhhV0pKT0U1ZmNEUlVOMjFhUzE5aVpVRmFVR1J5VGpjNVZqaE9iMU41ZUhaWlVUQTJUMlpWTlhaNFRHTmhaSEIwYmtkUWMyVlliV0kzYUVkNFJtWm5NR1V3UlMxMUxUQnFYekpyV0hneGMyOXJMVEJDU1ZwQ2VHWlFjRFZKYzFCbldEQkNRVzlSUVZSYVNtTm5jWGN0Ym5wcFJXTjRiazlLT1VGc2EyczRkVzl6VTBseVZ6bHFhMk5HY0RWcmMwVnpXa05DZUVnMGMwdHVaSGw1VUVNMFNXNUdNVlYwVG5wbE1qQjRVelpFU0ZZNWVuVjFjazUxWVcwMVFuUk9NSEJzTUdwS1JVdGZNUzFuVUZOamRIRk5iSEJaTFVSU1RYVnZaMWxGWlZCVGRsZDRaMmhGY214YVUwRlBVRmRaUkhOSVUw

'eyJmYyI6IkVyY0VDdmNEUVVwSE9VcHJVRWR2T0dkamVVOXVNRUZ3TUZsQk5sRnNOMmM0YVd4M1UwSlhMWGhRZFV4VWFYZDBSVkUyVkRKcU1qZHlNMHRNWDNjdFdrRTFkR2xHVkRkQ1oyWmtUbmt5VjA5aFZWODBiSGxMVVRZNFFsVnVSRkppV0RaelJFUjBUVWh1YjI5RlNuTkJTM0pvT0hjMVZYcE9ibVk0WDJoM1dGVTFlamR5VlhkVExVaEtMV1ZtZVRkRVNWRTRjVTVoVXkxM2RHWmhZemt6TjNGV2JtMURaRlYwV1ZaMFFYTnZVWGt0WmpWTGVIVldTMk5yU1dweVZUTnlaSFppYXpsNFRVTk9NRlZoTjI5TFV6bDFWVm94ZVd0M1NGOXBkbXh4TW5ob2JVdG9OR2d4VlVjNFlsZG5ORloxWlhwQ1VGOVhZaTB6TFZWaVpWVnpVa0Z6WlY4Mll6bHBhRTFoVVRrNE9HMVBNRzR5VEhsc1EwVjFURzVKWjBOS2FVdEZTRXBXVEhaNE5UUkpWRWRSWDJsbldFVXhiMnRNVmtSSWNuWkRXRWRtU0VwNVFuSkpibFpIV1dwS1pWTmhhRE51VEdJMGN6TmtXRTV6TURjd04zQkRVME56UVZnMlZuSlNVVWhVVlY5SlEyWm5RbTU1ZWtSMGNqRnRhblpsYlRoc2NrVkdTMXB0VFRGSlRUUmZiMWszYVZrNVducEZhWEpMUkVreVFVRnNZa013UjJsdVJVOU1VbVExZERSdVFtWldlbTVVUjIxTFpFZzNMWHB6ZEhsVmVYbHlia055WnpkcFUzaE5ibnAwV1ZCRlZrcHBVbkZKVkVSSWJFdFlUVVJEU1Mwd1lqVnpSRVpLZW5WeFZXVmhSazk2U1ZKNFVYQlBSVkVTRjJOR1ZWWmFPRWRRVUVzMmJYQjBVVkF5TkMxU2RVRk5HaUpCUmxoeVJXTnhkVmxJV1VFMmEyTnRMV3cwUVR

In [50]:
for result in result_dict['jobs_results']:
    print(result["title"], result["company_name"])

Machine Learning Engineer III Chewy
Machine Learning/AI Engineer Keysight Technologies, Inc.
Principal Machine Learning Engineer - Central AI Atlassian
Senior Machine Learning Engineer, Prescient LLMs Genentech
Lead Machine Learning Engineer RemoteWorker CA
Senior Software Engineer - Machine Learning FIS Global
Scientific Programming and Machine Learning Engineer Leidos
Staff Machine Learning Engineer Intuit
Principal Machine Learning Software Engineer Advanced Micro Devices, Inc
Senior Software Engineer,  Machine Learning Infrastructure Thumbtack


## Iterate over job titles

These titles reflect a wide range of roles that leverage data science and machine learning skills in various industries and specialties.

- Data Scientist
- Machine Learning Engineer
- Artificial Intelligence Specialist
- Data Analyst
- Business Intelligence Analyst
- Research Scientist (AI/ML)
- Deep Learning Engineer
- NLP Engineer (Natural Language Processing)
- Computer Vision Engineer
- Data Engineer
- Applied Scientist
- Quantitative Analyst (Quant)
- Predictive Modeler
- AI Solutions Architect
- Statistician
- Big Data Engineer
- Data Science Consultant
- Automation Engineer
- Analytics Manager
- Decision Scientist
- Operations Research Analyst
- Robotics Engineer
- Bioinformatics Data Scientist
- Healthcare Data Analyst
- Financial Data Scientist
- Customer Insights Analyst
- Marketing Data Analyst
- Data Strategy Manager
- Cloud AI Engineer
- Computational Scientist
- Fraud Detection Specialist
- Risk Analyst
- Data Architect
- Algorithm Engineer

For each keyword, do three searches, using pagination, this will result in around 30 jobs per keyword (assuming there are at least 30 jobs for the particular keyword), save each search results to a file. 

Note, just to be safe, wait a one second between each request e.g. using `time.sleep(1)`

In [35]:
job_titles = [
    "Data Scientist",
    "Machine Learning Engineer",
    "Artificial Intelligence Specialist",
    "Data Analyst",
    "Business Intelligence Analyst",
    "Research Scientist (AI-ML)",
    "Deep Learning Engineer",
    "NLP Engineer (Natural Language Processing)",
    "Computer Vision Engineer",
    "Data Engineer",
    "Applied Scientist",
    "Quantitative Analyst (Quant)",
    "AI Solutions Architect",
    "Statistician",
    "Big Data Engineer",
    "Data Science Consultant",
    "Automation Engineer",
    "Analytics Manager",
    "Operations Research Analyst",
    "Robotics Engineer",
    "Bioinformatics Data Scientist",
    "Financial Data Scientist",
    "Customer Insights Analyst",
    "Marketing Data Analyst",
    "Data Strategy Manager",
    "Cloud AI Engineer",
    "Computational Scientist",
    "Fraud Detection Specialist",
    "Risk Analyst",
    "Data Architect"
]

# print(len(job_titles)*3)

90


Now insert code to iterate over the job titles, and perform the searches.

Be very careful, this needs to be 100% correct before running it, otherwise you will burn through your free searches.

I would recommend doing just one iteration of the loop as a trial run, if that looks good, then do do the next iteration and carefully check the results, if everything looks good then do remaining 28 iterations.

Note: sometimes the Pagination will return less than 10 results, so you may end up with slightly less than 30 results per keyword, e.g. 25 to 30

Remember to clean the job tiles to remove any characters like spaces, `/` or `()`

In [60]:
# INSERT CODE HERE

------------------------
Data Engineer
SEARCH- 0
Senior Data Engineer : The MITRE Corporation
Data Scientist (Big Data Engineer) : Jobs via Dice
Databricks Data Engineer (Remote) : Loginsoft Consulting LLC
Senior Data Engineer : Dycom Industries, Inc.
Data Engineer : Spalding Consulting, Inc.
Data Engineer - Data Engineer I : INSPYR Solutions
Senior Data Engineer : Capital One
Data engineer : Maximus
Lead Data Engineer : Raymond James
Senior Data Engineer - Hybrid : Jobs via eFinancialCareers
SEARCH- 1
Data Engineer I : Tulane University
Data Infrastructure Engineer : Anrok
Lead Data Engineer : CliftonLarsonAllen LLP
Data Analytics Engineer : Under Armour, Inc.
Data Engineer : Fisher Investments
Reporting Data Engineer/BI : Motion Recruitment
Data Engineer : K2 Partnering
Data Engineer : Robert Half
Data Center Engineer (IT System Analyst/Engineer) : Oregon Health & Science University
Data Engineer I/II : Atomica
SEARCH- 2
Sr Adobe Analytics Engineer & Analyst : Citizens Bank, N.A.
Dat