<h1> Complete Web Scraping Project </h1>

<h3>Importing modules/libraries </h3>

In [1]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import pandas as pd

<h3> Defining functions </h3>

So, our function is going to be ... generating a URL(based on input)

and generate a request to the url. Like so

e.g : 

https://www.jobstreet.com.my/en/job-search/python-developer-jobs-in-selangor/

....java-developer-jobs-in-kuala-lumpur

See the body of the url? it's <b>different</b>-jobs-in-<b>different</b>

In [2]:
#Generating a url for job position and job location

def get_url(position,location):
    position = position.replace(' ','-')
    location = location.replace(' ','-') #Replacing white spaces with -, to generate proper url
    template = 'https://www.jobstreet.com.my/en/job-search/{}-jobs-in-{}/'
    #Applying string formatting
    url = template.format(position,location)#First {} is position, 2nd {} is location
    return url
    
    


In [3]:
url = get_url('java','kuala lumpur')
print(url)

https://www.jobstreet.com.my/en/job-search/java-jobs-in-kuala-lumpur/


<h3>Checking status code </h3>

In [4]:
response = requests.get(url)
print(response.status_code)

200


<h3>Checking contents of the webpage</h3>

In [5]:
response.text

'<!DOCTYPE html>\n  <html lang="en">\n    <head>\n      <meta charset="UTF-8">\n      <meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no">\n      \n      <link rel="icon" href="/static/shared-web/favicon-4e1897dfd0901e8a3bf7e604d3a90636.ico">\n<link rel="apple-touch-icon" href="/static/shared-web/iphone-2a9b65f22fc35e35808fcc317eb63810.png">\n<link rel="apple-touch-icon" sizes="76x76" href="/static/shared-web/ipad-d50023448fe0126ad1da4390a4af7f72.png">\n<link rel="apple-touch-icon" sizes="120x120" href="/static/shared-web/iphoneRetina-e8d65115bab819c629d8265de1e94120.png">\n<link rel="apple-touch-icon" sizes="152x152" href="/static/shared-web/ipadRetina-a71dfaf93883a40d06c0c7b6a97fad99.png">\n<meta name="twitter:image" content="/static/shared-web/banner-0c2ac79883746c7700892a4915e53610.png">\n<meta name="twitter:card" content="summary">\n<meta name="twitter:site" content="@JobStreetMY">\n<meta property="og:image" content="/static/shared-web/banner-0c2a

<h3>Using BeautifulSoup to scrap Company Information</h3>

Creating a BeautifulSoup object

Info that we are scraping are:
<ol>
    <li>Job name</li>
    <li>Company that offer the job</li>
    <li>Location</li>
    <li>Salary</li>
    <li>Summary info</li>
    <li>When is the job posted</li>       
</ol>

In [6]:
soup = BeautifulSoup(response.text,'html.parser')

<h3>Searching for div tags having a specific class name</h3>

We are finding div tags with this class name

<code>sx2jih0 zcydq852 zcydq842 zcydq872 zcydq862 zcydq82a zcydq832 zcydq8d2 zcydq8cq</code>

Time to use BeautifulSoup object!

Body of code explained, <code>soup.find_all('TheTagYouWant','YourClass')</code>

In [7]:
cards = soup.find_all('div','sx2jih0 zcydq852 zcydq842 zcydq872 zcydq862 zcydq82a zcydq832 zcydq8d2 zcydq8cq')
# cards will return you a list of the information with the div tags!

print(len(cards)) #This to check the number of items(in a list) in the page, we should get 30, and we got it!

30


<h3>Printing the first card</h3>

In [8]:
print(cards[0])

<div class="sx2jih0 zcydq852 zcydq842 zcydq872 zcydq862 zcydq82a zcydq832 zcydq8d2 zcydq8cq"><div class="sx2jih0"><div class="sx2jih0 zcydq89i" data-automation="job-card-logo"><img alt="THREE LOGIC CONCEPTS SDN BHD's logo" class="sx2jih0 _1aG7R_0" src="https://image-service-cdn.seek.com.au/5df9fc6d8693c21731d0388d38ac124a09f38fac/ee4dce1061f3f616224767ad58cb2fc751b8d2dc"/></div><div class="sx2jih0 zcydq89i"><h1 class="sx2jih0 zcydq82q _18qlyvc0 _18qlyvcv _18qlyvc3 _18qlyvc8"><a class="_18qlyvcu _9tnmfh1 _18qlyvc2 sx2jih0 sx2jihb zcydq824" href="/en/job/backend-developer-4751972?jobId=jobstreet-my-job-4751972&amp;sectionRank=1&amp;token=0~9c3b9ffd-2e17-4381-92f7-6eddf8ed619b&amp;fr=SRP%20Job%20Listing" rel="nofollow noopener noreferrer" target="_top"><div class="sx2jih0 _2j8fZ_0 sIMFL_0 _1JtWu_0"><span class="sx2jih0">BackEnd Developer</span></div></a></h1><span class="sx2jih0 zcydq82q _18qlyvc0 _18qlyvcv _18qlyvc1 _18qlyvc8">THREE LOGIC CONCEPTS SDN BHD</span></div><span class="sx2jih0

<h3>Prototyping the model with single card</h3>

In [9]:
single_card = cards[0]

<h4>Fetching the job name</h4>

In [10]:
job_name = single_card.find('div','sx2jih0 _2j8fZ_0 sIMFL_0 _1JtWu_0')
print(job_name)
print(job_name.text)

<div class="sx2jih0 _2j8fZ_0 sIMFL_0 _1JtWu_0"><span class="sx2jih0">BackEnd Developer</span></div>
BackEnd Developer


How can our users know that the info we provide is genuine?

<b>Simple, we will provide them with a link</b>

From <code>inspect</code> the website, we found out that the tag <code>a href</code> has our information

<b>Now we are fetching the job URL, so our user can click on the link and see the job description</b>

<h3>Fetching job URL</h3>

In [11]:
#Checking anchor tag

anchor_tag = single_card.a

#Fetching href info 

job_url = 'https://www.jobstreet.com.my'+anchor_tag['href'] #String concat needed for a proper link
print(job_url)

https://www.jobstreet.com.my/en/job/backend-developer-4751972?jobId=jobstreet-my-job-4751972&sectionRank=1&token=0~9c3b9ffd-2e17-4381-92f7-6eddf8ed619b&fr=SRP%20Job%20Listing


<h3>Fetching company name</h3>

Which company is offering this job(specify)?

From the website, <code>inspect</code> the company name and copy the from the <code>span class</code> copy the class 

<code>sx2jih0 zcydq82q _18qlyvc0 _18qlyvcv _18qlyvc1 _18qlyvc8</code>

In [12]:
company_name = single_card.find('span','sx2jih0 zcydq82q _18qlyvc0 _18qlyvcv _18qlyvc1 _18qlyvc8').text
print(company_name)

THREE LOGIC CONCEPTS SDN BHD


<h3>Fetching job location</h3>

Where is this job being posted?

Same approach, <code>inspect</code> on the location via the website and from <code>span class</code> copy the class

<code>sx2jih0 zcydq82q zcydq810 iwjz4h0</code>

In [13]:
job_location = single_card.find('span','sx2jih0 zcydq82q zcydq810 iwjz4h0').text
print(job_location)

Kuala Lumpur


<h3>Fetching salary information</h3>

Since <b>salary</b> info is available, our program will fetch it.

<b>But..</b> assuming there is <b>no information about salary</b>, then we can substitute the value to <b>N/A not available</b> 

Again, it's the same process with the previous. Let's find the <code>html tag</code> that is reponsible for our <b>salary information</b>

---

There's a problem, the <code>span</code> tag holds is responsible for two information/having same class name. How do we choose the correct one?

This is an exception for <b>jobs without salary</b> as there is no <code>span</code> tag for the salary in the <code>inspect</code> section

TL;dr 
>Card salary?(NO) Location(YES)? == 1 span tag<br>
>Card salary?(YES) Location(YES)? == 2 span tag (1 for loc, 1 for sal)

---

Let's fetch the salary data(and filter data with/without the tags), using <code>try and except</code>


In [14]:
#Fetching salary info


job_salary = single_card.find_all('span','sx2jih0 zcydq82q _18qlyvc0 _18qlyvcv _18qlyvc3 _18qlyvc6')
try:
    if len(job_salary)>=2:
        job_salary = job_salary[1]
        salary = job_salary.text
    else:
        salary = 'Not Available'
except IndexError:
    salary = 'Not Available'
        

<h3>Using find_all to highlight all the span tags</h3>

Reasoning behind <code>try & except</code>

Since some jobs might or not might provide salary info, we need to filter them out.<br>
Jobs with <b>no salary info</b> will output a <code>len(job_salary)</code> of 1<br>
Jobs with <b>salary info</b> will output a <code>len(job_salary)</code> of 2

<h4>Try & Except</h4>

So with <code>try and except</code> we only want our code to show the job info with salary included and pass on the non-existent salary infos<br>

<code>index[0] is location, index[1] is salary info</code>

But what if <code>len(job_salary)</code> is not >=2? We will get an <b>IndexError</b><br>
So at <code>except</code> we will handle that error.

Basically, <b> if len(job_salary) is =2 or more than 2(what we want)</b> it will show the salary<br>Else, it will output not available(and this applies if the index is out of range too!)



<h4>Testing the code when salary is availabe</h4>

<b> INFO MIGHT BE UPDATED SO THIS CELL/CODE MIGHT BE DIFFERENT RESULT </b>

In [15]:
print(salary)

MYR 8.5K - 13K monthly


<h3>Fetching summary info</h3>

For this example we will be directing our index to [1] because at [1] it has the summary info that we need <br><br> <b>NOTE : THIS IS BASED ON AT TIME OF WRITING THIS CODE</b>
    
The expected output should be <code>Senior Software Engineer (Java,Integration,Mobile App)</code>

So since the job summary is listed in a 'ul' tag, we need to <code>inspect</code> the summary and find the class with the 'ul' tag

In [16]:
card = cards[1]
job_name = card.find('div','sx2jih0 _2j8fZ_0 sIMFL_0 _1JtWu_0').text
print(job_name)

Software Engineer (Senior/ Junior on Java, NodeJS, ReactJS multiple positions)


<h4>Scraping with 'ul' tag</h4>

In [17]:
job_summary = card.find('ul','sx2jih0 sx2jih3 h6p8rp0 h6p8rp5')
print(job_summary)

<ul class="sx2jih0 sx2jih3 h6p8rp0 h6p8rp5"></ul>


<h4>Scraping with 'li' tag</h4>

In this section, we will only extract the texts under the 'li' tag from the 'ul'


In [18]:
for eachLI in job_summary('li'):
    print(eachLI.text)
    

But heres another problem, we don't want to store <code>eachLI</code> in line by line, it is more preferable if we store it in commas,<br>



<code>Exposure to Advanced,Flexible Working Hours, Near to Public</code>

<b>Why?</b> <br>Because we want to store it in a pandas dataframe

---



<h3>Fetching 'li' tag text and saving it in <code>summary</code></h3>

Saving the code in a variable through iteration

Body of code explained

<code> summary = summary + tag_text </code>

<b>First iteration</b>
<ol>
    <li> 1st Exposure to Advanced Fintech Technologies and Skills</li>
    <li> tag_text = Exposure to Advanced Fintech Technologies and Skills</li>
    <li> summary = Exposure to Advanced Fintech Technologies and Skills</li>
        
</ol>

<b>Second iteration</b>
<ol>
    <li> 2nd Flexible Working Hours</li>
    <li> tag_text = Flexible Working Hours</li>
    <li> summary = Exposure to Advanced Fintech Technologies and Skills+ Flexible Working Hours</li>
        
</ol>

<b>Third iteration</b>
<ol>
    <li> 3rd Near to Public Transport Hub - KL Sentral</li>
    <li> tag_text = Near to Public Transport Hub - KL Sentral</li>
    <li> summary = Exposure to Advanced Fintech Technologies and Skills+ Flexible Working Hours+Near to Public Transport Hub - KL Sentral</li>
        
</ol>

<b>Output will be</b><br>
>Exposure to Advanced Fintech Technologies and SkillsFlexible Working HoursNear to Public Transport Hub - KL Sentral

But since the output is confusing (no spaces), we will separate them using a comma

<code>summary = summary + tag_text + ',' </code>

Then, it instead of no spaces, it will be separated with a comma

>Exposure to Advanced Fintech Technologies and Skills,Flexible Working Hours,Near to Public Transport Hub - KL Sentral,



In [19]:
summary = ''
tag_text = ''

for eachLI in job_summary('li'):
    tag_text = eachLI.text
    summary = summary + tag_text + ','

print(summary)







---
<h4>See that last comma? Let's remove it</h4>

Recall, our output was

>Exposure to Advanced Fintech Technologies and Skills,Flexible Working Hours,Near to Public Transport Hub - KL Sentral<b>,</b>

Using <code>rstrip</code>

><code>rstrip</code> used to remove the character at the end of the string

In [20]:
summary = summary.rstrip(',')
print(summary)




---

There are chances where some job postings <b>may not</b> include job summaries

So, we are handling this in a similar manner (like the previous ones!)

Let's use <code>try & except</code> again

<h4>Putting job_summary inside <code>try except</code></h4>

Why <code>AttributeError</code>?

If the program is unable to find 'ul' tag with the <b>class name</b>, it will return an <code>AttributeError</code>, very different compared to the previous one of <code>IndexError</code>

In [21]:
try:
    summary = ''
    tag_text = ''
    job_summary = card.find('ul','sx2jih0 sx2jih3 h6p8rp0 h6p8rp5')
    for eachLI in job_summary('li'):
        tag_text = eachLI.text
        summary = summary + tag_text + ','
    if ',' in summary:
        summary = summary.rstrip(',')
    else:
        summary = 'Not Available'

except AttributeError:
    summary = 'Not Available'
    

In [22]:
print(summary)

Not Available


<h3>Checking for code if it works</h3>

>Where summary info is <b>not available</b>

See the code below

It might be possible that <b>'ul'</b> exists with the class name specified, but no <b>'li'</b>

That's why the output is empty

Originally we had <br><code>card = cards[0]

try:
    summary = ''
    tag_text = ''
    job_summary = card.find('ul','sx2jih0 sx2jih3 h6p8rp0 h6p8rp5')
    for eachLI in job_summary('li'):
        tag_text = eachLI.text
        summary = summary + tag_text + ','
    summary = summary.rstrip(',')
    
except AttributeError:
    summary = 'Not Available'</code>
    
This will give us an empty output, so to fix that we need a <code>condition (if else)</code>

In [23]:
card = cards[0]

try:
    summary = ''
    tag_text = ''
    job_summary = card.find('ul','sx2jih0 sx2jih3 h6p8rp0 h6p8rp5')
    for eachLI in job_summary('li'):
        tag_text = eachLI.text
        summary = summary + tag_text + ','
    if ',' in summary:
        summary = summary.rstrip(',')
    else:
        summary = 'Not Available'
except AttributeError:
    summary = 'Not Available'
    
print(summary)

Small, Agile and Growing Team,Open-mindedness Culture,Young and experienced company


<h3>Checking again</h3>

>Where summary info <b>is available</b>

In [24]:
card = cards[3]

try:
    summary = ''
    tag_text = ''
    job_summary = card.find('ul','sx2jih0 sx2jih3 h6p8rp0 h6p8rp5')
    for eachLI in job_summary('li'):
        tag_text = eachLI.text
        summary = summary + tag_text + ','
    if ',' in summary:
        summary = summary.rstrip(',')
    else:
        summary = 'Not Available'
except AttributeError:
    summary = 'Not Available'
    
print(summary)

Award-winning Fintech company,Attractive remuneration & challenging work,Flexibility to work remotely or from home


<h3>Fetching job posted date info</h3>

When was the job posted?

Output would be <code>2022-01-01T19:00:44.000Z</code>



In [25]:
card = cards[0]

time_tag = card.time
post_date = time_tag.get('datetime')
post_date

'2022-01-03T19:00:54.000Z'

<h3>Removing everything after T</h3>

But we only want the date <code>2022-01-01</code>

How do we remove the extra info(timezone)?

Using <code>str.split()</code><br>
>Remove things after special characters

<a>https://www.w3schools.com/python/ref_string_split.asp str.split() doc</a>

<code>string.split(separator,maxsplit)</code>

Example

<code>txt = "apple#banana#cherry#orange"<br>x=txt.split("#",1)<br>print(x)</code>

Will output <code>['apple','banana#cherry#orange']</code>
    
<code>txt = "apple#banana#cherry#orange"<br>x=txt.split("#",2)<br>print(x)</code>

Will output <code>['apple','banana','cherry#orange']</code>

<b>Let's remove them!</b>

In [26]:
post_date = post_date.split('T')
post_date = post_date[0]
print(post_date)

2022-01-03


<h3>Fetching today's date</h3>

--AS OF WRITING DATE IS : 2022-01-03--

So later we can compare post_date with today

So, our user can be notified that the job was posted 2days ago,3 days ago, etc.

And that info is going to be displayed in a pandas dataframe

In [27]:
today = datetime.today().strftime('%Y-%m-%d')
print(today)

2022-01-04


<h2>Alright, let's combine them all together!</h2>

We are going to create a function that scraps JobInfo from every card

In [28]:
def get_jobs(card):
    job_name = card.find('div','sx2jih0 _2j8fZ_0 sIMFL_0 _1JtWu_0').text
    anchor_tag = card.a
    job_url = 'https://www.jobstreet.com.my'+anchor_tag['href']
    company_name = card.find('span','sx2jih0 zcydq82q _18qlyvc0 _18qlyvcv _18qlyvc1 _18qlyvc8').text
    job_location = card.find('span','sx2jih0 zcydq82q zcydq810 iwjz4h0').text
    # Fetching job summary
    try:
        summary = ''
        tag_text = ''
        job_summary = card.find('ul','sx2jih0 sx2jih3 h6p8rp0 h6p8rp5')
        for eachLI in job_summary('li'):
            tag_text = eachLI.text
            summary = summary + tag_text + ','
        if ',' in summary:
            summary = summary.rstrip(',')
        else:
            summary = 'Not Available'
    except AttributeError:
        summary = 'Not Available'
    #Fetching job salary
    job_salary = card.find_all('span','sx2jih0 zcydq82q _18qlyvc0 _18qlyvcv _18qlyvc3 _18qlyvc6')
    try:
        if len(job_salary)>=2:
            job_salary = job_salary[1]
            salary = job_salary.text
        else:
            salary = 'Not Available'
    except IndexError:
        salary = 'Not Available'
    #Fetching job post date
    time_tag = card.time
    post_date = time_tag.get('datetime')
    post_date = post_date.split('T')
    post_date = post_date[0] #This to remove timezone
    today = datetime.today().strftime('%Y-%m-%d')
    
    
    job_info = (job_name,job_url,company_name,job_location,summary,
               salary,post_date,today)
    
    return job_info
    
    
    

<h3>Checking if the function works</h3>

Calling the function for every job card

In every detail of <code>cards</code> we are storing it in a temp variable <code>everyCard</code><br>Then, it is saved to <code>jobDetails</code><br>Then, we append it to <code>records</code>

The function <code>get_jobs</code> is looped for every index in <code>cards</code>

In [29]:
records = []

for everyCard in cards:
    jobDetails = get_jobs(everyCard)
    records.append(jobDetails)

In [30]:
print(len(records))

#30 records, so it's similar to print(len(cards))

30


<h4>Printing job details for first card</h4>

In [31]:
print(records[0])

('BackEnd Developer', 'https://www.jobstreet.com.my/en/job/backend-developer-4751972?jobId=jobstreet-my-job-4751972&sectionRank=1&token=0~9c3b9ffd-2e17-4381-92f7-6eddf8ed619b&fr=SRP%20Job%20Listing', 'THREE LOGIC CONCEPTS SDN BHD', 'Kuala Lumpur', 'Small, Agile and Growing Team,Open-mindedness Culture,Young and experienced company', 'MYR\xa08.5K - 13K monthly', '2022-01-03', '2022-01-04')


<h4>Printing job details for second card</h4>

In [32]:
print(records[1])

('Software Engineer (Senior/ Junior on Java, NodeJS, ReactJS multiple positions)', 'https://www.jobstreet.com.my/en/job/software-engineer-senior-junior-on-java-nodejs-reactjs-multiple-positions-4773963?jobId=jobstreet-my-job-4773963&sectionRank=2&token=0~9c3b9ffd-2e17-4381-92f7-6eddf8ed619b&fr=SRP%20Job%20Listing', 'MHC Asia Group', 'Kuala Lumpur', 'Not Available', 'MYR\xa03.5K - 7K monthly', '2022-01-01', '2022-01-04')


<h4>Print job details for third card</h4>

In [33]:
print(records[2])

('Senior Software Engineer (Java, Integration, Mobile App)', 'https://www.jobstreet.com.my/en/job/senior-software-engineer-java-integration-mobile-app-4767131?jobId=jobstreet-my-job-4767131&sectionRank=3&token=0~9c3b9ffd-2e17-4381-92f7-6eddf8ed619b&fr=SRP%20Job%20Listing', 'SUNLINE TECHNOLOGY (MALAYSIA) SDN BHD', 'Kuala Lumpur', 'Exposure to Advanced Fintech Technologies and Skills,Flexible Working Hours,Near to Public Transport Hub - KL Sentral', 'Not Available', '2021-12-31', '2022-01-04')


<h3><center>Made with Love by Fikri Fansuri</center></h3>