<a href="https://colab.research.google.com/github/ayundina/job_posts_analysis/blob/main/scrape_job_posts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Scrape google job search for job posts**

In [1]:
%%capture
!pip install google-search-results

Serpapi requires an API key that can be accuired after registration and generating the key on [their website](https://serpapi.com/)

In [2]:
from serpapi import GoogleSearch

GoogleSearch.SERP_API_KEY = "9059a2cb2dcfb8c98d2120e706ccda574e25a9af870a95c1546cc62b832b9ead" #delete key

def get_search_results(params: dict) -> dict:
  client = GoogleSearch(params)
  response = client.get_dict()
  return response

Make a dictionary with search parameters

In [3]:
def get_params(query: str, page: int, eng: str, loc: str, lang: str) -> dict:
  params = {
      "q": query,
      "start": page,
      "engine": eng,
      "location": loc,
      "hl": lang
  }
  return params

There is a list of titles that are related to a job position and list of pages to scrape results. This function iterates over given titles while taking search results from all of the required pages

In [4]:
def scrape_google_jobs(titles: list, pages: list) -> dict:
  data = []

  for job_title in titles:
    for page in pages:
      params = get_params(job_title, page, "google_jobs", "Netherlands", "en")
      jobs = get_search_results(params)
      jobs = jobs.get("jobs_results")
      data = [*data, *jobs]
  return data

In [5]:
import pandas as pd

def filter_entries(d: dict, selection: list) -> pd.DataFrame:
  df = pd.DataFrame(d)
  df = df[selection]
  print(f"before dropping duplicates - rows = {df.shape[0]}")
  df = df.drop_duplicates()
  print(f"after dropping duplicates - rows = {df.shape[0]}")
  return df

def save_to_file(df: pd.DataFrame, path: str) -> None:
  df.to_csv(f"{path}", index=False)

def read_from_file(path: str) -> pd.DataFrame:
  df = pd.read_csv(f"{path}")
  return df

In [20]:
job_titles = ["data scientist", "machine learning", "artificial intelligence"]
pages = [0, 10, 20, 30, 40, 50]

jobs = scrape_google_jobs(job_titles, pages)

In [21]:
print(f"Title: {jobs[0]['title']}")
print(f"Company name: {jobs[0]['company_name']}")
print(f"Location: {jobs[0]['location']}")
print(f"Description: {jobs[0]['description'][0:300]}...")

Title: Graduate Data Scientist
Company name: Optiver
Location:   Amsterdam, Netherlands   
Description: Can you solve this puzzle?

An ant leaves its anthill in order to forage for food. It moves with the speed of 10cm per second, but it doesn't know where to go, therefore every second it moves randomly 10cm directly north, south, east or west with equal probability.
• If the food is located on east-w...


In [22]:
column_selection = ["title", "company_name", "location", "description"]

jobs_df = filter_entries(jobs, column_selection)
jobs_df.head(3)

before dropping duplicates - rows = 180
after dropping duplicates - rows = 159


Unnamed: 0,title,company_name,location,description
0,Graduate Data Scientist,Optiver,"Amsterdam, Netherlands",Can you solve this puzzle?\n\nAn ant leaves it...
1,Process Data Scientist,FrieslandCampina,"Amersfoort, Netherlands",• Work with stakeholders (supply chain experts...
2,Data Scientist,felyx,"Amsterdam, Netherlands",Company Description\n\nWith the intensifying t...


DataFrame is saved as csv file inside "jobs" folder on google drive. Connect to google drive before running the next cell. 

In [23]:
import os

folder = "/content/drive/MyDrive/jobs"
file_name = "/data_science_jobs.csv"

if not os.path.exists(folder):
  os.makedirs(folder)

save_to_file(jobs_df, f"{folder}{file_name}")

Test if saving DataFrame to file went well and data can be retrieved

In [24]:
df = read_from_file(f"{folder}{file_name}")
df.head(3)

Unnamed: 0,title,company_name,location,description
0,Graduate Data Scientist,Optiver,"Amsterdam, Netherlands",Can you solve this puzzle?\n\nAn ant leaves it...
1,Process Data Scientist,FrieslandCampina,"Amersfoort, Netherlands",• Work with stakeholders (supply chain experts...
2,Data Scientist,felyx,"Amsterdam, Netherlands",Company Description\n\nWith the intensifying t...
