## Demo: Web scraping with BeautifulSoup

this demo shows how to use BeautifulSoup to crawl job listing in indeed.

In [3]:
## Import the necessary packages
from bs4 import BeautifulSoup
import urllib
import re
import pandas as pd

### 1. Reach the link of jobs first

use indeed mobile web version since its html is simplier

In [4]:
from urllib.request import urlopen
url = "https://www.indeed.com/m/jobs?q=data+scientist&l=Los+Angeles%2C+CA"
page = urlopen(url)
soup = BeautifulSoup(page, 'lxml')

all_matches = soup.find_all('a', attrs={'rel':['nofollow']})
for i in all_matches:
    print (i['href'])
    print (type(i['href']))
    print ("https://www.indeed.com/m/"+i['href'])

viewjob?jk=46caf455b09ff764
<class 'str'>
https://www.indeed.com/m/viewjob?jk=46caf455b09ff764
viewjob?jk=6451fc0e875f748b
<class 'str'>
https://www.indeed.com/m/viewjob?jk=6451fc0e875f748b
viewjob?jk=a5858c00f357dcc0
<class 'str'>
https://www.indeed.com/m/viewjob?jk=a5858c00f357dcc0
viewjob?jk=7b8f1e2c8b577bf6
<class 'str'>
https://www.indeed.com/m/viewjob?jk=7b8f1e2c8b577bf6
viewjob?jk=321102a554ae64aa
<class 'str'>
https://www.indeed.com/m/viewjob?jk=321102a554ae64aa
viewjob?jk=16b629bef547ac43
<class 'str'>
https://www.indeed.com/m/viewjob?jk=16b629bef547ac43
viewjob?jk=4518485c49710109
<class 'str'>
https://www.indeed.com/m/viewjob?jk=4518485c49710109
viewjob?jk=1369bc7dfa807cb5
<class 'str'>
https://www.indeed.com/m/viewjob?jk=1369bc7dfa807cb5
viewjob?jk=6b6eaa13bfbb8270
<class 'str'>
https://www.indeed.com/m/viewjob?jk=6b6eaa13bfbb8270
viewjob?jk=1718af7fd649a97a
<class 'str'>
https://www.indeed.com/m/viewjob?jk=1718af7fd649a97a


### 2. Find the title, company, location and detailed job description for each job

Let's first see a brief example:

In [32]:
test_html= \
'''
<html>
	<body>
		<p>
			<b>
				<font size="+1">Analyst - Data Science</font>
			</b>
			<br>The Boston Consulting Group - <span class="location">Los Angeles, CA</span>
		</p>
	</body>
</html>
'''


In [33]:
bs = BeautifulSoup(test_html,'lxml')

In [34]:
print(bs.body.p.b.font.text)

Analyst - Data Science


In [35]:
print(bs.body.p.text)



Analyst - Data Science

The Boston Consulting Group - Los Angeles, CA



In [19]:
print(bs.body.p.span.text)

Los Angeles, CA


#### Find title, company, location and job description for one position

In [67]:
title = []
company = []
location = []
jd = []
for each in all_matches:
    jd_url= 'http://www.indeed.com/m/'+each['href']
    jd_page = urlopen(jd_url)
    jd_soup = BeautifulSoup(jd_page, 'lxml')
    jd_desc = jd_soup.findAll('div',attrs={'id':['desc']}) ## find the structure like: <div id="desc"></>
    break
#     title.append(jd_soup.body.p.b.font.text)
#     company.append(jd_desc[0].span.text)
#     location.append(jd_soup.body.p.span.text)
#     jd.append(jd_desc[0].text)

In [68]:
## Job Description
print(jd_desc[0].text)

What you’ll be doing...
We are looking for a Technical Business Intelligence Manager to join the team to help drive a data-focused product culture for Fios.
As a data driven product organization, our mission is to turn terabytes of valuable data into insights and get a deep understanding of video and viewers to impact the strategy and direction of IPTV and video experiences. You will study user behavior, strategic initiatives, markets, content, and new features and bring data and insights into every decision we make. You will find patterns but also assume that our challenges are unique and fearlessly question the product hypothesis through data insights. Above all, your work will impact the way the world experiences TV.
What you’ll do: Perform analyses on large sets of data to extract actionable insights that will help drive decisions across the business Communicate data-driven insights and recommendations to key stakeholders You will develop analytic methods, build models, and define 

In [64]:
## Job Title 
print(jd_soup.body.p.b.font.text)

Business Intelligence Manager - Data Analytics


In [55]:
## Company Name
print(jd_desc[0].span.text)
print(jd_soup.body.p.span.previous_sibling.split('-')[0][1:])

30+ days ago
Fuel Cycle 


In [69]:
title

['Data Scientist',
 'Data Scientist, Revenue Analytics',
 'Data Engineer / Scientist',
 'Data Entry & Analysis Clerk (entry level)',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist/Quantitative Analyst',
 'Data Entry Operator',
 'Analytics Expert, Team Manager - Automation & Programming',
 'Business Intelligence Manager - Data Analytics']

#### Save the data into Data Frame

In [71]:
job = {'title': title,
         'company': company,
         'location': location,
         'Job Description': jd}
df = pd.DataFrame.from_dict(job)

In [72]:
df

Unnamed: 0,Job Description,company,location,title
0,Interested in working in a fast-paced start-up...,30+ days ago,"Los Angeles, CA",Data Scientist
1,Snap Inc. is a camera company. We believe that...,Snap Inc.,"Los Angeles, CA","Data Scientist, Revenue Analytics"
2,High profile VC Backed Startup seeks Data Engi...,HireClout,"Santa Monica, CA",Data Engineer / Scientist
3,MPCS is a national transportation compliance c...,1 day ago,"Sylmar, CA 91342",Data Entry & Analysis Clerk (entry level)
4,MULTIPLE POSITIONS AVAILABLE\n\nDUTIES\n\nThe ...,L.A. Care Health Plan,"Los Angeles, CA 90017",Data Scientist
5,JOANY is on a mission to make buying health in...,Joany,"Los Angeles, CA 90017",Data Scientist
6,Data Scientist/Quantitative Analyst/R Programm...,4 days ago,"Los Angeles, CA",Data Scientist/Quantitative Analyst
7,National Genetics Institute (NGI) is part of t...,LabCorp,"Los Angeles, CA",Data Entry Operator
8,PRACTICE AREA:\n\n\nBCG GAMMA delivers powerfu...,The Boston Consulting Group,"Los Angeles, CA","Analytics Expert, Team Manager - Automation & ..."
9,What you’ll be doing...\nWe are looking for a ...,Verizon,"Los Angeles, CA 90094",Business Intelligence Manager - Data Analytics


If we don't break the loop above, we can crawl all the job information from one page.

## 3. Change Pages Automatically

In [73]:
title = []
company = []
location = []
jd = []
url = "https://www.indeed.com/m/jobs?q=data+scientist&l=Los+Angeles%2C+CA"
for i in range(2):
    
    page = urlopen(url)
    soup = BeautifulSoup(page, 'lxml')
    all_matches = soup.findAll(attrs={'rel':['nofollow']})
    for each in all_matches:
        jd_url= 'http://www.indeed.com/m/'+each['href']
        jd_page =urlopen(jd_url)
        jd_soup = BeautifulSoup(jd_page, 'lxml')
        jd_desc = jd_soup.findAll(attrs={'id':['desc']})
        title.append(jd_soup.body.p.b.font.text)
        company.append(jd_desc[0].span.text)
        location.append(jd_soup.body.p.span.text)
        jd.append(jd_desc[0].text)
        
    ## Change the pages to Next Page
    url_all = soup.findAll(attrs={'rel':['next']})
    url = 'http://www.indeed.com/m/'+ str(url_all[0]['href'])


In [74]:
job = {'title': title,
         'company': company,
         'location': location,
         'Job Description': jd}
df = pd.DataFrame.from_dict(job)

In [75]:
df

Unnamed: 0,Job Description,company,location,title
0,Interested in working in a fast-paced start-up...,30+ days ago,"Los Angeles, CA",Data Scientist
1,MPCS is a national transportation compliance c...,1 day ago,"Sylmar, CA 91342",Data Entry & Analysis Clerk (entry level)
2,JOANY is on a mission to make buying health in...,Joany,"Los Angeles, CA 90017",Data Scientist
3,In addition to the responsibilities listed bel...,Kaiser Permanente,"Pasadena, CA",Data Scientist
4,MULTIPLE POSITIONS AVAILABLE\n\nDUTIES\n\nThe ...,L.A. Care Health Plan,"Los Angeles, CA 90017",Data Scientist
5,OPEN RECRUITMENT\n\nMANAGEMENT TEAM\n\n(CURREN...,Long Beach City College,"Long Beach, CA",Data Scientist
6,Job Description\n\nThe Role\nThis is an execut...,INgrooves Music Group,"Los Angeles, CA",Data Scientist
7,National Genetics Institute (NGI) is part of t...,LabCorp,"Los Angeles, CA",Data Entry Operator
8,Clicktripz is looking for a skilled Data Scien...,Clicktripz,"Manhattan Beach, CA",Data Scientist
9,What you’ll be doing...\nWe are looking for a ...,Verizon,"Los Angeles, CA 90094",Business Intelligence Manager - Data Analytics
