# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Import Dependencies 

In [25]:
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By

### Path to webdriver (Firefox, Chrome) 

In [26]:
# Ensure that the driver path is correct before running this script.
# Microsoft Windows
# driver_path = "./drivers/windows/geckodriver.exe"
# Linux
driver_path = r"./drivers/geckodriver"
# driver = webdriver.Firefox(executable_path=driver_path)
# driver = webdriver.Chrome(executable_path=driver_path)
driver = webdriver.Chrome()

### Define position and location 

In [27]:
## Enter a job position
position = "data analyst"
## Enter a location (City, State or Zip or remote)
locations = "Canada"

def get_url(position, location):
    url_template = "https://ca.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

url = get_url(position, locations)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])

### Scrape job postings

In [28]:
## Number of postings to scrape
postings = 200

jn= 0
for i in range(0, postings, 10):
    driver.get(url + "&start=" + str(i))
    driver.implicitly_wait(3)

    jobs = driver.find_elements(By.CLASS_NAME, 'job_seen_beacon')

    for job in jobs:
        result_html = job.get_attribute('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')
        
        jn += 1
        
        liens = job.find_elements(By.TAG_NAME, "a")
        links = liens[0].get_attribute("href")
        
        title = soup.select('.jobTitle')[0].get_text().strip()
        company = soup.select('.companyName')[0].get_text().strip()
        location = soup.select('.companyLocation')[0].get_text().strip()
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''
       
        dataframe = pd.concat([dataframe, pd.DataFrame([{'Title': title,
                                          "Company": company,
                                          'Location': location,
                                          'Rating': rating,
                                          'Date': date,
                                          "Salary": salary,
                                          "Description": description,
                                          "Links": links}])], ignore_index=True)
        print("Job number {0:4d} added - {1:s}".format(jn,title))

Job number    1 added - Senior Data Analyst
Job number    2 added - Business Analyst
Job number    3 added - Business Analyst
Job number    4 added - REMOTE Project-based roles - Data Analyst
Job number    5 added - Credit Risk Analyst - Korean*
Job number    6 added - Data Analytics and Reporting Analyst
Job number    7 added - Data analyst student
Job number    8 added - Business Analyst
Job number    9 added - Data Analyst Intern
Job number   10 added - Junior Data Support Analyst
Job number   11 added - Data Analyst
Job number   12 added - HR Systems Business Analyst - HYBRID
Job number   13 added - Data Analyst
Job number   14 added - Climate Data Analyst
Job number   15 added - Data Operations Analyst (Remote)
Job number   16 added - Data Analyst
Job number   17 added - Data Analyst
Job number   18 added - Data Analyst, Competitive Intelligence
Job number   19 added - Junior Business Analyst (remote)
Job number   20 added - Data Analyst - Data & Analytics
Job number   21 added - 

In [29]:
driver.quit()

### Scrape full job descriptions

In [30]:
Links_list = dataframe['Links'].tolist()
#Links_list

In [31]:
import random
import time

In [32]:
driver = webdriver.Chrome()
descriptions=[]
for i in Links_list:
        driver.get(i)
        driver.implicitly_wait(random.randint(3, 8))
        try:
            jd = driver.find_element(By.XPATH, '//div[@id="jobDescriptionText"]').text
            descriptions.append(jd)
        except Exception:
            descriptions.append('n/a')
            pass
        time.sleep(random.randint(5,10))

dataframe['Descriptions'] = descriptions

In [33]:
driver.quit()

### Save results

In [34]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
dataframe.to_csv(date + "_" + position + "_" + locations + ".csv", index=False)

In [35]:
dataframe

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,Senior Data Analyst,Spinify,"Toronto, ON",,EmployerActive 4 days ago,"$90,000–$110,000 a year","3+ years of experience in data analysis, prefe...",https://ca.indeed.com/pagead/clk?mo=r&ad=-6NYl...,About us\nSpinify is a dynamic and innovative ...
1,Business Analyst,Eperformance Inc.,"Hybrid remote in Ottawa, ON",,EmployerActive 2 days ago,"$60,000–$80,000 a year",Design business processes and data flows betwe...,https://ca.indeed.com/pagead/clk?mo=r&ad=-6NYl...,Eperformance Inc. is a solutions and services ...
2,Business Analyst,Russell Hendrix Foodservice Equipment,+1 locationRemote,2.6,PostedPosted 30+ days ago,,You have strong analytical skills with ability...,https://ca.indeed.com/pagead/clk?mo=r&ad=-6NYl...,Russell Hendrix is looking to add a Business A...
3,REMOTE Project-based roles - Data Analyst,Deloitte,"Remote in Toronto, ON",3.9,PostedPosted 17 days ago,,"In addition, for this particular data analyst ...",https://ca.indeed.com/rc/clk?jk=ad5ed76a4a1a89...,Job Type: Independent Contractor\nReference co...
4,Credit Risk Analyst - Korean*,KEB Hana Bank Canada,"North York, ON",3.1,PostedPosted 1 day ago,,Reviews anomalies of the data outputs indicate...,https://ca.indeed.com/pagead/clk?mo=r&ad=-6NYl...,Job Title: Credit Risk Analyst\nDept./Br.: Cre...
...,...,...,...,...,...,...,...,...,...
295,Business Analyst with Regulatory,Software International,"Toronto, ON",,PostedPosted 30+ days ago,$60 an hour,Experience with Finance Reporting projects wit...,https://ca.indeed.com/rc/clk?jk=65617b01fb8ee0...,Our company Software International (SI) suppli...
296,Information Analyst,Canadian Red Cross,Remote,3.9,PostedPosted 30+ days ago,"$61,382–$78,216 a year",Ensure system and data integrity.\nExpertise i...,https://ca.indeed.com/rc/clk?jk=e69da81cb3e0c5...,Title: Information Analyst\nLocation: Remote\n...
297,"Senior Data Analyst, Foundry",Providence Healthcare,"Vancouver, BC",4.0,PostedPosted 1 day ago,,Oversees data analytics requests and reporting...,https://ca.indeed.com/rc/clk?jk=d0fd9e23936c96...,Article Flag: Mandatory Vaccination Please Not...
298,"Trade Desk Support Analyst, Summer 2023 (Co-op...",BMO Financial Group,"Toronto, ON",3.8,PostedPosted 9 days ago,,Analyzes data and information to provide insig...,https://ca.indeed.com/rc/clk?jk=362cdf0df92369...,"100 King Street West Toronto Ontario,M5X 1A1\n..."
