<a href="https://colab.research.google.com/github/ayundina/job_posts_analysis/blob/main/scrape_job_posts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Scrape google job search for job posts**

In [1]:
%%capture
!pip install google-search-results

Serpapi requires an API key that can be accuired after registration and generating the key on [their website](https://serpapi.com/)

In [30]:
from serpapi import GoogleSearch

GoogleSearch.SERP_API_KEY = "insert_api_key"

def get_search_results(params: dict) -> dict:
  client = GoogleSearch(params)
  response = client.get_dict()
  return response

Make a dictionary with search parameters

In [3]:
def get_params(query: str, page: int, eng: str, loc: str, lang: str) -> dict:
  params = {
      "q": query,
      "start": page,
      "engine": eng,
      "location": loc,
      "hl": lang
  }
  return params

There is a list of titles that are related to a job position and list of pages to scrape results. This function iterates over given titles while taking search results from all of the required pages

In [12]:
def scrape_google_jobs(titles: list, pages: list) -> dict:
  data = []

  for job_title in titles:
    for page in pages:
      params = get_params(job_title, page, "google_jobs", "Netherlands", "en")
      jobs = get_search_results(params)
      jobs = jobs.get("jobs_results")
      data = [*data, *jobs]
  return data

In [25]:
import pandas as pd

def filter_entries(d: dict, selection: list) -> pd.DataFrame:
  df = pd.DataFrame(d)
  df = df[selection]
  print(f"before dropping duplicates - {df.shape[0]}")
  df = df.drop_duplicates()
  print(f"after dropping duplicates - {df.shape[0]}")
  return df

def save_to_file(df: pd.DataFrame, path: str) -> None:
  df.to_json(f"{path}", orient="records", lines=True)

def read_from_file(path: str) -> pd.DataFrame:
  df = pd.read_json(f"{path}", orient="records", lines=True)
  return df

In [None]:
job_titles = ["data science", "machine learning", "artificial intelligence"]
pages = [0, 10, 20, 30, 40, 50]

jobs = scrape_google_jobs(job_titles, pages)

In [31]:
print(jobs[0])

{'title': 'Enterprise Data Scientist', 'company_name': 'FrieslandCampina', 'location': '  Amersfoort, Netherlands   ', 'via': 'via FrieslandCampina', 'description': "• Work with stakeholders (supply chain experts, factories, experts on shopfloor, etc) to understand business requirements and data needs.\n• Collect, cleanse, and validate supply chain (time series) data to ensure data accuracy and completeness.\n• Analyze large and complex data sets to identify patterns, trends, and insights related to the identified improvement areas in our supply chain\n• Work with Data and Analytics teams to implement data-driven solutions\n• Develop models and algorithms to improve our performance in the End-to-end Supply Chain\n• Converting standard definitions and logic and helping build the programmer's language\n• Communicate findings and insights to stakeholders in a clear and concise manner, and provide recommendations for action\n• Support Loss analyses in Our Way of Working\n• Stay up-to-date 

In [27]:
column_selection = ["title", "company_name", "location", "description"]

jobs_df = filter_entries(jobs, column_selection)
jobs_df.head(3)

before dropping duplicates - 180
after dropping duplicates - 156


Unnamed: 0,title,company_name,location,description
0,Enterprise Data Scientist,FrieslandCampina,"Amersfoort, Netherlands",• Work with stakeholders (supply chain experts...
1,Data Scientist,felyx,"Amsterdam, Netherlands",Company Description\n\nWith the intensifying t...
2,Healthcare Data Science Specialist,EMA,"Noordwijk, Netherlands",Job grade: AD06\n\nInternal/Interagency job gr...


DataFrame is saved as json file inside "data_science_jobs" folder on google drive. Connect to google drive before running the next cell. 

In [28]:
import os

folder = "/content/drive/MyDrive/jobs"
file_name = "/data_science_jobs_01.json"

if not os.path.exists(folder):
  os.makedirs(folder)

save_to_file(jobs_df, f"{folder}{file_name}")

Test if saving DataFrame to file went well and data can be retrieved

In [29]:
df = read_from_file(f"{folder}{file_name}")
df.head(3)

Unnamed: 0,title,company_name,location,description
0,Enterprise Data Scientist,FrieslandCampina,"Amersfoort, Netherlands",• Work with stakeholders (supply chain experts...
1,Data Scientist,felyx,"Amsterdam, Netherlands",Company Description\n\nWith the intensifying t...
2,Healthcare Data Science Specialist,EMA,"Noordwijk, Netherlands",Job grade: AD06\n\nInternal/Interagency job gr...
