# Web Scraping for Indeed.com Australia and Salary Prediction

### PART 1: Indeed Australia Website scrapping using Beautiful Soup

This project is a test of three major skills: collecting data by scraping a website, using natural language processing, and building a binary classifier.

Author : Ayesha Khatib

I have used Beautiful Soup Web scrapping technique to scrape Indeed Australia Job website and will be using Machine learning techniques to answer two questions :

- *Question 1 : Factors that impact Salary.*

- *Questions 2 : Factors that distinguish job category.*

### Import the required libraries

In [13]:
import requests
import bs4
import pandas as pd
import numpy as np
import time
from time import sleep
from bs4 import BeautifulSoup
import urllib
import urllib.request
import re

### 1. Examining the URL and Page structure

In [4]:
#URL = "https://au.indeed.com/jobs?q=data&l=Australia&fromage=last"

URL = "https://au.indeed.com/jobs?q=data+scientist&l=Australia"

#### URL Structure : 
1. DNS domain of the Indeed website <"https://au.indeed.com"> is the home page.
-
2. "q=" where the query starts and “what” field searching on the page, separating search terms with “+” (i.e. searching for “data+scientist” jobs)
-
3. "&l=" begins the string for city of interest, separating search terms with “+” if city is more than one word.
4. if salary is specified it will parse by commas in the salary figure and will be preceded by %24 and then the number before the first comma, it will then be broken by %2C and continue with the rest of the number (i.e. %2420%2C000 = $20,000)

In [5]:
#printing soup in a more structured tree format that makes for easier reading
page = requests.get(URL)

soup = BeautifulSoup(page.text, "html.parser")

print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <script src="/s/ab2a39a/en_AU.js" type="text/javascript">
  </script>
  <link href="/s/970d98c/jobsearch_all.css" rel="stylesheet" type="text/css"/>
  <link href="http://au.indeed.com/rss?q=data+scientist&amp;l=Australia" rel="alternate" title="Data Scientist Jobs in Australia" type="application/rss+xml"/>
  <link href="/m/jobs?q=data+scientist&amp;l=Australia" media="only screen and (max-width: 640px)" rel="alternate"/>
  <link href="/m/jobs?q=data+scientist&amp;l=Australia" media="handheld" rel="alternate"/>
  <script type="text/javascript">
   if (typeof window['closureReadyCallbacks'] == 'undefined') {
        window['closureReadyCallbacks'] = [];
    }

    function call_when_jsall_loaded(cb) {
        if (window['closureReady']) {
            cb();
        } else {
            window['closureReadyCallbacks'].push(cb);
        }
    }
  </script>
  <meta conte

### 2. Following individual functions for the required columns for the solving Q1 and Q2.

#### 2.1 Job Title : 
Fetching the job title information involved three steps:

1. pulling out all <div> tags with class including “row”
2. identifying <a> tags with attribute “data-tn-element”:”jobTitle”
3. for each of these <a> tags, find the value of attributes “title”

In [6]:
def extract_job_title_from_result(soup): 
    jobs = []
    for div in soup.find_all(name="div", attrs={"class":"row"}):
        for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
            jobs.append(a["title"])
    return(jobs)
        
extract_job_title_from_result(soup)

# Job title for 15 jobs on one page.

['Data Scientist - Analytics',
 'BioMind Data Scientist/Engineer (Relocate to Beijing)',
 'Junior Big Data Engineer',
 'Consultant, Data Scientist',
 'Statistical Scientist and Data Specialist',
 'Data Scientist',
 'Junior Data Scientist/ Machine Learning Engineer',
 'Data Scientist (As per award)',
 'Alibaba Cloud Data Scientist Melbourne',
 'EL 1 Data scientist',
 'Data Scientist',
 'Junior Data Scientist',
 'Data Scientist',
 'Laureate Data Scientist',
 'Data Scientist',
 'Data Analyst/Data Scientist']

#### 2.2 Job Description :

In [7]:
def extract_summary_from_result(soup): 
    summaries = []
    spans = soup.findAll('span', attrs={'class': 'summary'})
    for span in spans:
        summaries.append(span.text.strip())
    return(summaries)

    
extract_summary_from_result(soup)
# Job description for 15 jobs on one page.


['As a Data Scientist in our team, you will leverage your deep experience in statistics, machine learning and data analysis to drive significant improvements to...',
 'The candidate is a skilled data scientist who will be building deep learning models for medical image diagnosis applications such as segmentation, localisation...',
 'You are comfortable collaborating with engineers from other teams, product owners, business teams, and data analysts and data scientists....',
 'Reporting to the Manager, Delivery Analytics Experiments, the Data Scientist will collaborate with line of business stakeholders, leveraging NAB’s technology...',
 'Leading advanced quantitative analysis, surveys and presentation of data to key stakeholders. Working in conjunction with the Work Package Lead and related...',
 'Mentoring a team of junior data scientists to deliver innovative and impactful insights from data. Utilize exploratory data analysis techniques to understand...',
 'We also believe great data 

#### 2.3 Company :

In [8]:
def extract_company_from_result(soup): 
    companies = []
    for div in soup.find_all(name='div', attrs={"class":"row"}):
        company = div.find_all(name="span", attrs={"class":"company"})
        if len(company) > 0:
            for b in company:
                companies.append(b.text.strip())
        else:
            sec_try = div.find_all(name='span', attrs={"class":"result-link-source"})
            for span in sec_try:
                companies.append(span.text.strip())
    return(companies)
 
extract_company_from_result(soup)

['Domain Group',
 'biomind.ai',
 'Xpand Group Pty Ltd',
 'National Australia Bank',
 'Deakin University',
 'Ambulance Victoria',
 'Intellify',
 'Victorian Government',
 'Alibaba',
 'Australian Taxation Office',
 'Dialog Information Technology',
 'The Eclair Group',
 'Deakin University',
 'Swinburne University of Technology',
 'ANZ Banking Group',
 'Mediterranean Shipping Company']

#### 2.4 Salary :

In [9]:
def extract_salary_from_result(soup): 
    salaries = []
    for div in soup.find_all(name="div", attrs={"class":"row"}):
        try:
            salaries.append(div.find("no-wrap").text)
        except:
            try:
                div_two = div.find(name="div", attrs={"class":"sjcl"})
                div_three = div_two.find("div")
                salaries.append(div_three.text.strip())
            except:
                salaries.append("Nothing_found")
    return(salaries)

extract_salary_from_result(soup)

['Domain Group\n\n\n9 reviews',
 'biomind.ai',
 'Xpand Group Pty Ltd',
 'National Australia Bank\n\n\n324 reviews',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'Nothing_found',
 'ANZ Banking Group\n\n\n905 reviews',
 'Mediterranean Shipping Company\n\n\n329 reviews']

In [10]:
#### 2.5 Putting everything together to scrape data from the first page of results :

In [None]:
max_resuts = 1000

jobs_df = pd.DataFrame({'title':titles,'company':company,'location':locations,'salary':salaries,'summary':summaries})

In [None]:
#Scrape data from the first page of results
#First get html for search page

url = 'https://au.indeed.com/jobs?q=data&l=Australia'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'lxml')

#Then make a list of all the urls of the individual job listings
link = soup.findAll('a',{'data-tn-element':'jobTitle'})
urls = []
for i in range(len(link)):
    urls.append(link[i]['href'])
     
#Now iterate through this list to scrape relevant information from each page
for address in urls:
    sub_url = "https://au.indeed.com"+str(address)
    response2 = requests.get(sub_url)
    html2 = response2.text
    soup2 = BeautifulSoup(html2, 'lxml')
    
    title = soup2.find_all('h3',{'class':'jobsearch-JobInfoHeader-title'})
    [titles.append(title[i].text) for i in range(len(title))]

    
    for jobs in range(0, max_resuts,100):
        for fetch in soup2.findAll('div', {'class':' row'}):
            try: 
                company = fetch.find('span', {'class':'company'}).text.replace('\n', '')
            except:
                company = 'None'
            
            try:
                locations = fetch.find('span', {'class':'location'}).text.replace('\n', '')
            except:
                locations = 'None'
            
            try:
                salaries = fetch.find('span', {'class' : "no-wrap"}).text.replace('\n', '')
            except:
                salaries = 'None'
                    
            #summary = fetch.find('span', {'class':'summary'}).text.replace('\n', '')
                    
        summ = (soup2.findAll('span',{'class':'summary'}))
        [summaries.append(summ[i].text) for i in range(len(summ))]

In [63]:
type(jobs_df)

pandas.core.frame.DataFrame

**Step 2 : Preparing dataset consisting of the all job postings with details :**

In [64]:
jobs_df.head(5)

Unnamed: 0,Company,Job_Title,Summary,Location,Salary
0,The Eclair Group,Junior Data Scientist,Industry experience as a Data Anal...,Sydney NSW,"$70,000 - $90,000 a year"
1,,Junior Data Scientist,Opportunity to start your career i...,Sydney Central Business District NSW,
2,Intellify,Junior data scientist/machine learning engineer,We also believe great data science...,Sydney NSW,"$80,000 - $100,000 a year"
3,DataRobot,Customer Facing Data Scientist,Customer Facing Data Scientists wo...,Sydney NSW,
4,Freshwater Group,Data Scientist,The Data Scientist will:. Manage d...,Sydney NSW,


In [65]:
jobs_df.shape # (1260,5)

(1257, 5)

In [66]:
jobs_df.Summary[1]

'            Opportunity to start your career in Data Science Permanent role in Sydney CBD Be part of the brightest team I am currently looking for a few Junior Data...'

In [76]:
# Verify the job titles
jobs_df.Job_Title.unique()

# Findings : most are related to data but few are not the match we are looking for but will keep them for now.

array(['Junior Data Scientist',
       'Junior data scientist/machine learning engineer',
       'Customer Facing Data Scientist', 'Data Scientist',
       'Senior Data Scientist', 'Ikon | Data Scientist',
       'Hi-Freq Quantitative Analyst', 'Agile Business Analyst',
       'Data Scientist/ Data Engineer', 'Machine Learning Consultant',
       'Data Scientist/Analyst', 'Lead Data Scientist',
       'FMCG Consultant - Sydney',
       'Local Instructor - Data Science Immersive (Full-time Contrac...',
       'Quantitative Researcher- Machine Learning',
       'Enterprise Sales Account Executive - Australia',
       'Regional Marketing Manager, ANZ', 'Statistical Modeller',
       'Quant Trader - Futures and Options Prop Trading',
       'Marketing Automation Specialist',
       'Senior Software Engineer, Data Science & Analytics',
       'Business Development Manager',
       'Technical Solutions Engineer, Google Cloud Platform, Big Dat...',
       'Forward Deployed Solution Leader',
 

In [77]:
# Save the file after the Website scrapping.

jobs_df.to_csv('./job_posts_scrapped.csv',index=False, encoding='utf-8')