# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Import Dependencies 

In [4]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from datetime import datetime
from arsenic import get_session
from arsenic.browsers import Firefox
from arsenic.services import Geckodriver
import asyncio
from selenium import webdriver

# disable arsenic logging to stdout
import structlog
import logging

logger = logging.getLogger()
logger.setLevel(logging.WARN)
structlog.configure(logger_factory=lambda: logger)

### Path to webdriver (Firefox, Chrome) 

In [5]:
# Ensure that the driver path is correct before running this script.
# Microsoft Windows
# driver_path = "./drivers/windows/geckodriver.exe"
#driver_path = "./programs/geckodriver.exe"
# driver_path = 'C:\programs\geckodriver.exe'
# # Linux
# # driver_path = "./drivers/linux/geckodriver"
# # driver_path = "/usr/bin/geckodriver"

# options = {
#   'moz:firefoxOptions': {
#     # if you want it to be headless
#     'args': ['-headless'],
#     'log': {'level': 'warn'},
#     # Needed for windows / non-default firefox install
#     'binary': 'C:\\Program Files\\Mozilla Firefox\\firefox.exe'
#   }
# }

driver_path = r"./drivers/geckodriver"
# driver = webdriver.Firefox(executable_path=driver_path)
# driver = webdriver.Chrome(executable_path=driver_path)
driver = webdriver.Chrome()

### Define position and location 

In [10]:
## Enter a job position
position = "data scientist"
## Enter a location (City, State or Zip or remote)
locations = "remote"

def get_url(position, location):
    url_template = "https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

url = get_url(position, locations)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])

### Scrape job postings

In [11]:
## Number of postings to scrape
postings = 1000

## Number of browser instances to use
n = 3

pages = list(range(0, postings, 10))

state = {
  'lock': asyncio.Lock(),
  'ids': set(),
  'n': 0
}
             
async def get_jobs(url, pages, state):
  data = []
  async with get_session(Geckodriver(binary=driver_path, log_file=asyncio.subprocess.PIPE), Firefox(**options)) as session:
    for i in pages:
      await session.get(url + "&start=" + str(i))
      jobs = await session.get_elements("[class='job_seen_beacon']")

      for job in jobs:
        result_html = await job.get_property('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')

        liens = await job.get_elements("a")
        link = await liens[0].get_attribute("href")

        title = soup.select('.jobTitle')[0].get_text().strip()
        try:
          company = soup.select('.companyName')[0].get_text().strip()
        except:
          continue
        location = soup.select('.companyLocation')[0].get_text().strip()
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''
            
        Id = f"{title}{company}{location}{rating}{date}{salary}{description}"
        dupe = False
        async with state['lock']:
          if Id in state['ids']:
            dupe = True
          else:
            state['ids'].add(Id)
            state['n'] = state['n'] + 1
            print("Job number {0:4d} added - {1:s}".format(state['n'],title))
        if dupe:
          continue

        data.append({
          'Title': title,
          "Company": company,
          'Location': location,
          'Rating': rating,
          'Date': date,
          "Salary": salary,
          "Description": description,
          "Links": link
        })

        # print("Job number {0:4d} added - {1:s}".format(jn,title))
      i = i + 10
  return data

tasks = [asyncio.create_task(get_jobs(url, p, state)) for p in np.array_split(pages, n)]
dataframe = pd.DataFrame([j for task in tasks for j in await task])

Job number    1 added - AI Engineer
Job number    2 added - Principal Marketing Data Scientist
Job number    3 added - Senior Data Analyst - UHC M&R Part D - Minnetonka, MN or Remote USA
Job number    4 added - Machine Learning Engineer
Job number    5 added - Instructor, Data Science
Job number    6 added - Machine Learning Engineer - Platform
Job number    7 added - Applied ML Engineer
Job number    8 added - Associate Director, Clinical Data Science, Spotfire/R (Remote)
Job number    9 added - Senior Data Scientist (AI/ML)
Job number   10 added - Staff Data Scientist
Job number   11 added - Senior Data Scientist
Job number   12 added - Data Scientist
Job number   13 added - AI Engineer
Job number   14 added - DATA SCIENTISTS
Job number   15 added - Data and Technology Leader – IBM Watson Advertising
Job number   16 added - Senior Data Scientist/ Data Modeler ( Only USC/ GC/ EAD)
Job number   17 added - Senior Data Scientist (Remote Friendly)
Job number   18 added - Data Scientist
Jo

Job number  154 added - Training and Placement in Data Science and Business Analyst on W2
Job number  155 added - Data Visualization Specialist
Job number  156 added - Data Scientist - 3 positions - 100% remote
Job number  157 added - ML Engineer
Job number  158 added - Data Scientist
Job number  159 added - Data Scientist
Job number  160 added - Remote Data Scientist
Job number  161 added - Healthcare Data Scientist
Job number  162 added - Conversational AI chatbot Engineer
Job number  163 added - Senior Data Scientist | Bankrate
Job number  164 added - Machine Learning Engineer: Deep Reinforcement Learning
Job number  165 added - AI/ML Engineer
Job number  166 added - Data Scientist -- Principal
Job number  167 added - Data Scientist - Remote
Job number  168 added - Geospatial Machine Learning Software Develope
Job number  169 added - Software Engineer - AI Team
Job number  170 added - Senior Data Engineer (Python & ML)
Job number  171 added - Sr. Data Analyst
Job number  172 added -

Job number  305 added - Data Scientist
Job number  306 added - Data Scientist
Job number  307 added - ML Engineer
Job number  308 added - Senior Associate, Data Science (R-13261)
Job number  309 added - Engineer - Machine Learning
Job number  310 added - Senior Data Analyst / Jr. Data Engineer
Job number  311 added - Associate Machine Learning Engineer (Remote)
Job number  312 added - Director Data Science
Job number  313 added - Junior Machine Learning Engineer
Job number  314 added - Senior Data Analyst
Job number  315 added - VP, Data Science
Job number  316 added - Lead Data Scientist, Managing Consultant - Cleared
Job number  317 added - Data Scientist
Job number  318 added - Data Scientist 5 - Experimentation & Causal Inference
Job number  319 added - Data Scientist (Remote)
Job number  320 added - Data Scientist, East
Job number  321 added - Senior Data Scientist (Machine Learning)- Remote
Job number  322 added - Senior Data Analyst
Job number  323 added - Associate Data Scienti

Job number  468 added - Machine Learning Engineer
Job number  469 added - Machine Learning Engineer
Job number  470 added - Sr. Data Scientist - Capital Modeling
Job number  471 added - Staff Data Scientist - Retail
Job number  472 added - Associate Machine Learning Engineer (Remote)
Job number  473 added - Data Scientist
Job number  474 added - Senior Data Scientist (Data Analytics)
Job number  475 added - Senior/Principal Statistical Programmer (Remote)
Job number  476 added - Data Scientist
Job number  477 added - Machine Learning Engineer
Job number  478 added - Sr. Energy Storage Machine Learning Engineer - REMOTE
Job number  479 added - Director, Statistical Programming
Job number  480 added - Principal Statistical Programmer FSP (Remote)
Job number  481 added - Machine Learning Engineer Sr
Job number  482 added - Senior Machine Learning Software Engineer - Remote
Job number  483 added - Data Analyst
Job number  484 added - Remote Data Scientist / Senior Data Scientist
Job number

Job number  623 added - Sr. Customer Product Data Scientist
Job number  624 added - Senior Data Scientist
Job number  625 added - Senior Machine Learning Engineer
Job number  626 added - Senior Data Analyst
Job number  627 added - Sr. BSA/AML Data Analyst
Job number  628 added - Data Cloud Solutions Associate
Job number  629 added - Junior Data Scientist-Remote
Job number  630 added - Senior Data Scientist
Job number  631 added - AI MLOps Engineer
Job number  632 added - Data Scientist - Machine Learning Ops
Job number  633 added - Principal Statistician
Job number  634 added - AI Data Engineer
Job number  635 added - ML/NLP/Data Engineer
Job number  636 added - Data Scientist II / Senior Data Scientist
Job number  637 added - ArborCount Data Technician
Job number  638 added - Data Scientist (COVID Vaccine) (Remote)
Job number  639 added - Data Scientist
Job number  640 added - Senior Financial Data Analyst
Job number  641 added - Senior Data Analyst
Job number  642 added - Senior Mach

Job number  771 added - Climate Data Scientist
Job number  772 added - Senior Data Scientist - Risk
Job number  773 added - Data Analytics & Data Science Expert & Technical Coach - Part Time, Remote
Job number  774 added - Applied ML Engineer
Job number  775 added - Senior Data Analyst - Data Governance
Job number  776 added - Senior Data Analyst
Job number  777 added - Senior Machine Learning Engineer
Job number  778 added - Principal Data Scientist - Machine Learning (remote)
Job number  779 added - Senior Data Analyst (Remote)
Job number  780 added - Manager of Manufacturing Analytics/ Data Science (Remote)
Job number  781 added - Patent Agent / Patent Engineer / Software AI CS Computer
Job number  782 added - Sr Software Engineer (AI) - Telecommute
Job number  783 added - Senior Data Engineer with on-prem experience - 100% remote from anywhere is US
Job number  784 added - Data Scientist
Job number  785 added - Clinical Stat Programming Analyst
Job number  786 added - Sr. Data Scie

### Scrape full job descriptions

In [16]:
Links_list = dataframe['Links'].tolist()

import random

async def get_description(urls):
  descriptions = []
  async with get_session(Geckodriver(binary=driver_path, log_file=asyncio.subprocess.PIPE), Firefox(**options)) as session:
    for url in urls:
      await session.get("https://www.indeed.com"+url)
      jd = await session.get_element('#jobDescriptionText')
      descriptions.append(await jd.get_text())
      await asyncio.sleep(random.random() * 1.5)
  return descriptions

## Number of browser instances to use
n = 3

tasks = [asyncio.create_task(get_description(urls)) for urls in np.array_split(Links_list, n)]
dataframe['Descriptions'] = [desc for task in tasks for desc in await task]

### Save results

In [17]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
dataframe.to_csv(date + "_" + position + "_" + locations + ".csv", index=False)

In [18]:
dataframe

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,Data Scientist,"Shaw Industries Group, Inc.",Remote,3.8,PostedPosted 14 days ago,,Partner with data scientists across the enterp...,/pagead/clk?mo=r&ad=-6NYlbfkN0CrvEjiI1EGEnRdQc...,We are looking for a data scientist to join ou...
1,"Data Scientist, Marketing & Online (Remote)",The Home Depot,"Remote in Atlanta, GA 30361",3.7,PostedToday,"$90,000 - $160,000 a year",55% Solution Development - Design and develop ...,/pagead/clk?mo=r&ad=-6NYlbfkN0BAuTfAu5ThYozS55...,Position Purpose:\nThe Data Scientist is respo...
2,"Analyst I, Data Science",Liberty Mutual Insurance,Remote,3.6,PostedPosted 9 days ago,"$70,100 - $161,600 a year",Competencies typically acquired through a Mast...,/pagead/clk?mo=r&ad=-6NYlbfkN0D19kSVUiNzG2UWy1...,The Product Design and Modeling Department of ...
3,Data Scientist (Remote),Yelp,Remote,3.4,PostedPosted 30+ days ago,"$96,000 - $220,000 a year","Communicate key insights from analyses, experi...",/rc/clk?jk=0072004f2d3180a4&fccid=0e94073a1c93...,"At Yelp, it’s our mission to connect people wi..."
4,Data Scientist - NLP,"Ursus, Inc.","Remote in Menlo Park, CA 94025",4.9,PostedPosted 12 days ago,$40.00 - $48.65 an hour,""" Apply knowledge of statistics, machine learn...",/pagead/clk?mo=r&ad=-6NYlbfkN0CT8vBT9H5mqECx2d...,JOB TITLE: Data Scientist - NLP\nLOCATION: Rem...
...,...,...,...,...,...,...,...,...,...
830,"Director, Statistical Programming",Firma Clinical Research,Remote,3.8,PostedPosted 6 days ago,,Career Opportunity – Director Statistical Prog...,/pagead/clk?mo=r&ad=-6NYlbfkN0Ccr8WjSa4DbpNgRF...,Career Opportunity – Director Statistical Prog...
831,Principal Statistical Programmer FSP (Remote),Labcorp,"Remote in Wilmington, NC 28403+1 location",3.4,PostedPosted 6 days ago,,Hiring for Principal Statistical Programmer FS...,/pagead/clk?mo=r&ad=-6NYlbfkN0DPts0BZf5RUYLDzn...,Hiring for Principal Statistical Programmer FS...
832,Senior Machine Learning Software Engineer - Re...,Dropbox,"Remote in Boston, MA 02108+1 location",3.9,PostedPosted 6 days ago,,Role Description Dropbox is looking for a Mach...,/pagead/clk?mo=r&ad=-6NYlbfkN0BXuQyu8a89IGjYOq...,Role Description Dropbox is looking for a Mach...
833,Remote Data Scientist / Senior Data Scientist,Direct Auto Insurance,"Remote in Winston-Salem, NC 27103",3.0,PostedPosted 30+ days ago,,**Remote work available** Job Summary: This ro...,/pagead/clk?mo=r&ad=-6NYlbfkN0ABn5GwwiAtE4UwcQ...,**Remote work available**\n\nJob Summary:\nThi...
