# Webscraping using the requests library
## Import necessary libraries


   - In this notebook, I will use `requests` and `BeautifulSoup` libraries to scrape job postings from Indeed website. 
   - Initially, data was scraped from Glassdoor and Monster as well, but there were many overlaps between the job aggregate sites, and Indeed being the largest one, it was sufficient to obtain data solely from Indeed for now.

In [302]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

In [84]:
url = 'https://www.indeed.com/jobs?q=Data+%26+Database&l=san+francisco%2C+ca&sort=date&start=80' # This is a string of a URL that we want to grab info from
res = requests.get(url)

### Get status codes|

In [85]:
res.status_code

200

In [4]:
res.content

b'<!DOCTYPE html>\n<html lang="en" dir="ltr">\n<head>\n<meta http-equiv="content-type" content="text/html;charset=UTF-8">\n<script type="text/javascript" src="//d3fw5vlhllyvee.cloudfront.net/s/ffe72ff/en_US.js"></script>\n<link href="//d3fw5vlhllyvee.cloudfront.net/s/105b986/jobsearch_all.css" rel="stylesheet" type="text/css">\n<link rel="alternate" type="application/rss+xml" title="Data &amp; Database Jobs, Employment in San Francisco, CA" href="https://rss.indeed.com/rss?q=Data+%26+Database&l=san+francisco%2C+ca&sort=date">\n<link rel="alternate" media="only screen and (max-width: 640px)" href="/m/jobs?q=Data+%26+Database&l=san+francisco%2C+ca&sort=date">\n\n<script type="text/javascript">\n\nif (typeof window[\'closureReadyCallbacks\'] == \'undefined\') {\nwindow[\'closureReadyCallbacks\'] = [];\n}\n\nfunction call_when_jsall_loaded(cb) {\nif (window[\'closureReady\']) {\ncb();\n} else {\nwindow[\'closureReadyCallbacks\'].push(cb);\n}\n}\n</script>\n<meta name="ppstriptst" content="

## Creating a `BeautifulSoup` object
 - create `soup` object, then using the `find()` attribute, determine the necessary tags for desired information you want to extract from the site.


### `soup.find()`

In [91]:
soup = BeautifulSoup(res.content, 'lxml')

In [7]:
h1 = soup.find('h1')

In [8]:
type(h1)

bs4.element.Tag

In [9]:
h1.text

'\n            Data & Database jobs in San Francisco, CA'

### `soup.find_all()`

- to find out which 'div' tags we want, you can use `find_all()` to list out all possible classes in 'div'
- using `inspect` option, you can easily narrow down which div tags contain the desired text or a block of information

In [21]:
h2_tags = soup.find_all('h2')

In [37]:
[tag.text.strip() for tag in h2_tags]

['Principal Business Data Analyst\nnew',
 'Lead Big Data Architect ( SME ) & Training lead\nnew',
 'Data Scientist\nnew',
 'Graduate Business Analyst\nnew',
 'Data Engineer\nnew',
 'Senior Data Analyst\nnew',
 'Software Engineer II: Data Pipeline\nnew',
 'Informatica Developer\nnew',
 'Warranty Business Analyst\nnew',
 'Sr. SQL Server DBA (w2 only)\nnew',
 'Senior Statistical Programmer\nnew',
 'Data Engineering Manager\nnew',
 'Senior Data Engineer\nnew',
 'Lending Technology Documents Business Analyst\nnew',
 'Data Analyst\nnew']

In [92]:
div_tags = soup.find_all('div')
[tag.attrs for tag in div_tags]

[{'id': 'accessibilityBanner', 'role': 'navigation', 'aria-label': 'skip'},
 {'class': ['main'], 'data-gnav-region': 'Main', 'data-reactroot': ''},
 {'class': ['gnav-DesktopHeaderContainer']},
 {'class': ['gnav-DesktopHeader', 'is-small-header', 'with-skip-nav']},
 {'class': ['gnav-DesktopHeader-logo']},
 {'class': ['gnav-Logo-icon'], 'aria-hidden': 'true'},
 {'class': ['gnav-PageLinks']},
 {'class': ['gnav-DesktopHeader-secondaryLinks']},
 {'class': ['gnav-ProfileNavLinks']},
 {'class': ['gnav-ResumeLink']},
 {'class': ['gnav-LoggedOutAccountLink'], 'data-gnav-element-name': 'SignIn'},
 {'class': ['gnav-VariantToggleLink-forEmployers']},
 {'class': ['gnav-MobileHeaderContainer']},
 {'class': ['gnav-MobileHeader', 'is-small-header'],
  'data-gnav-region': 'Main'},
 {'class': ['gnav-MobileHeader-logo']},
 {'class': ['gnav-Logo-icon'], 'aria-hidden': 'true'},
 {'class': ['gnav-MobileHeader-links']},
 {'class': ['gnav-LoggedOutAccountLink'], 'data-gnav-element-name': 'SignIn'},
 {'class':

In [93]:
div_class_sjcl = soup.find('div', {'class': 'jobsearch-SerpJobCard'}, )
div_class_sjcl

<div class="jobsearch-SerpJobCard unifiedRow row result" data-ci="105245054" data-empn="8690912762161442" data-jk="9de45bbc5df33830" data-tu="https://jsv3.recruitics.com/partner/a51b8de1-f7bf-11e7-9edd-d951492604d9.gif?client=521&amp;rx_c=&amp;rx_campaign=indeed16&amp;rx_group=100137&amp;rx_source=Indeed&amp;job=RS2-160649028&amp;rx_r=none&amp;rx_ts=20210218T234812Z&amp;rx_pre=1&amp;indeed=sp" id="pj_9de45bbc5df33830">
<style>
.jobcard_logo{margin:6px 0}.jobcard_logo img{width:auto;max-width:80px;max-height:30px}.jasxrefreshcombotst .jobcard_logo img{max-height:2rem;max-width:100%}
</style>
<h2 class="title">
<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/pagead/clk?mo=r&amp;ad=-6NYlbfkN0CpFJQzrgRR8WqXWK1qKKEqALWJw739KlKqr2H-MSI4eh4ZOxqVaUrhNSyjVEAq5t5jALyjjh_OWyJDl83j5GnWTU7LTov5wyXu9VP1MwBaO1MR_FJD5HBuL4WLDLhm_4Bqkr0N_FHQNUsVFrB9JQcJrBA3b8fI2IeZIDwvEsTdQKkBTXMoFIsvndoqcXTdvd5qKpCgzkphzlNqMKhBKXD6dtZoHF8CFthOMA2eve0Tu7K7giJaY5mOy2bWs2aOH0L2oXlk90IkBB7CgDMMPZ6l6k14

- `'class': 'jobsearch-SerpJobCard'` looks to contain the smaller blocks of information we want. 
- You can create a separate soup object for the specific class to get information from each 'job cards'

## Extract relevant information
- Once you narrow down which 'block' of class to look into, you can deduce what sub-classes you want to extract from the job card. 
- In this case, subclasses are titled intuitively, and finding title, company, location, and summary are all quite easy.

In [89]:
title = div_class_sjcl.find('h2', {'class': 'title'})
title.text.strip().replace('\n', '_')

'Principal Business Data Analyst_new'

In [94]:
company = div_class_sjcl.find('span', {'class': 'company'})
company.text.strip()

'CyberCoders'

In [53]:
ratings = div_class_sjcl.find('span', {'class': 'ratingsContent'})
ratings.text.strip()

'4.2'

In [58]:
location = div_class_sjcl.find('div', {'class': 'location'})
location.text.strip()

'San Francisco, CA 94102 (Downtown area)'

In [95]:
summary = div_class_sjcl.find('div', {'class': 'summary'})
summary.text.strip()

'In addition to rolling up your sleeves and helping to guide the data engineers in a hands-on capacity, you will actually influence the tech direction for the…'

In [96]:
salary = div_class_sjcl.find('div', {'class': 'salarySnippet'})
salary.text.strip()

'$140,000 - $180,000 a year'

### Get the detailed job description using `href`
- However, from the main search page, the job description is not fully displayed.
- The summary section is limited to only a couple of sentences, so you need to click the link to get the full description.
- For the individual pages, you can create new soup objects and scrape the full job description.

In [62]:
job_page = div_class_sjcl.find('a').attrs['href']

In [64]:
job_res = requests.get("http://www.indeed.com" + job_page)

In [65]:
job_soup = BeautifulSoup(job_res.content, 'lxml')

In [76]:
job_desc = job_soup.find('div', {'class':'jobsearch-jobDescriptionText'})
job_desc.text.strip().replace('\n', ' ')

"Overview Come join Intuit as our Data Capabilities Leader. What you'll bring7-12 years of experience working in product, marketing or other related analytics fields.Excels at organizing and tracking work, and mobilizing others to contribute to a shared visionAbility to tell stories with data, educates effectively, and instills confidence in recommendations, motivating others to act on themUnderstanding of a variety of analytics tools and technologies such as Segment and AmplitudeAbility to manage multiple projects simultaneously to meet objectives and deadlines.Outstanding communications skills with both technical and non-technical colleagues.Strong organizational skills, time management, portfolio prioritization experience, and accountability required.Attention to detail and bias to action: pursue data quality, troubleshoot data validation, and see issues to resolution How you will leadLead the strategy and vision around the technical work needed on data initiativesForm a point of vi

### Put together and test if it works for other job postings

- To make sure the classes and subclasses are limited to the one example above, I decided to put them together test it on a job posting. 

In [81]:
jobcards = soup.find_all('div', {'class': 'jobsearch-SerpJobCard'})
for jobcard in jobcards:
    print(jobcard)


<div class="jobsearch-SerpJobCard unifiedRow row result" data-ci="327811381" data-empn="3018725801175434" data-jk="8c0932ee405ce4e0" data-tu="https://analytics.prng.co/i?e=se&amp;aid=indeed-firstclick&amp;p=web&amp;tv=no-js-0.1.0&amp;tna=indeed-firstclick-tracker&amp;se_ca=click&amp;se_ac=first-click&amp;se_la=jdv7vsb&amp;se_va=0.00&amp;indeed=sp" id="pj_8c0932ee405ce4e0">
<style>
.jobcard_logo{margin:6px 0}.jobcard_logo img{width:auto;max-width:80px;max-height:30px}.jasxrefreshcombotst .jobcard_logo img{max-height:2rem;max-width:100%}
</style>
<h2 class="title">
<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/pagead/clk?mo=r&amp;ad=-6NYlbfkN0ClvsRMZZYbWkncYKklQZhomgGvtzxN2NDL0Yep_oEtFRBSc398mSurUi9Mojk8pioLkD7-Zmd2-h3tMvU6wmDIM30sR5rTcVlpZ_fpyNY9mCGbfuxgCFWfwIGBtfL4hEIj79CikPuVi9CjiePfaXArQ4BJnaAFK_ycxF1m_7BdNU1pCCTY0gM-xyFxTwJOf17a3baB6ZnqI5IMOyJsTiVQQ4q0sYbpchcgqhR-XegJPu8ztFaOdFyD968gyOYFbc56AhSc-yemTbaRA2wzRASxlirmj1lKIFCmpD12owebrRfON-pRUUshsMuXstMyOOk-vrQ3AZ4

In [172]:
job_page = jobcard.find('a').attrs['href']
job_res = requests.get("http://www.indeed.com" + job_page)
job_soup = BeautifulSoup(job_res.content, 'lxml')
job_soup.find('div', {'id':'jobDescriptionText'}).text.strip().replace('\n', ' ')


"Bioinformatics Programmer III involves developing and utilizing computational tools and systems to analyze and interpret biological or other research data. The person in this role utilizes and develops algorithms, computational techniques, and statistical methodologies; helps in the design of new experiments; implements end-user needs in database searching and integration; maintains the computational infrastructure and tracks the flow of samples and information for large-scale studies; and provides web-based bioinformatics and access to public and proprietary databases.The primary role of this position is to use skills as a seasoned, experienced bioinformatics programming professional with a broad understanding of computational algorithms and systems to apply computational and statistical methods to analyze and interpret data obtained from the next generation of functional genomics screens. This Bioinformatics Programmer III will demonstrate good judgment in selecting methods and tech

## `def indeed_scrape`

- Indeed is very sensitive to web crawling and will block your IP address after 10 pages.
- For this initial scraping, I was able to do 100 pages (result ended after 76 loops), but for routine scrapings, 10 pages should be enough to fetch new postings. 

In [240]:
def indeed_scrape(keywords, location, maxpage) 
    keywords = keywords.replace(' ', '+')
    location = location.replace(' ', '+')
    url = 'https://www.indeed.com/jobs?q='+keywords+'&l='+location+'&sort=date'
    postings = []

    for i in range(0, maxpage):
        if (i == 0):
            page_append = ''
        else:
            page_append = "&start=" + str(i * 10)

        # get page
        res = requests.get(url + page_append)
        print(f'res {i+1} code: ', res.status_code)

        if res.status_code != 200:
            pass

        else:
            soup = BeautifulSoup(res.content, 'lxml')
            jobcards = soup.find_all('div', {'class': 'jobsearch-SerpJobCard'})
            for jobcard in jobcards:
                posting = {}
                posting['title'] = jobcard.find('h2', {'class': 'title'}).text.strip().replace('\n', '_')
                posting['company'] = jobcard.find('span', {'class': 'company' }).text.strip().replace('\n', '')       
                posting['location'] = jobcard.find(['div', 'span'], {'class': 'location accessible-contrast-color-location'}).text.strip()
                posting['summary'] = jobcard.find('div', {'class': 'summary'}).text.strip()

                job_page = jobcard.find('a').attrs['href']
                job_res = requests.get("http://www.indeed.com" + job_page)
                job_soup = BeautifulSoup(job_res.content, 'lxml')


                try:
                    posting['description'] = job_soup.find('div', {'class':'jobsearch-jobDescriptionText'}).text.strip()
                except:
                    if AttributeError:
                        pass

                try:
                    posting['salary'] = jobcard.find('div', {'class': 'salarySnippet holisticSalary'}).text.strip()
                except:
                    posting['salary'] = 'None'

                try:
                    posting['rating'] = jobcard.find('span', {'class': 'ratingsContent'}).text.strip()
                except:
                    posting['rating'] = 'None'        

                postings.append(posting)
                df = pd.DataFrame(postings)
                df.to_csv(f'df_{i+1}.csv', index=False)
        time.sleep(5)

    df = pd.DataFrame(postings)

    
    
    
keywords = 'data database'
location = 'san francisco, ca'
maxpage = 100 
indeed_scrape(keywords, location, maxpage) 

res 1 code:  200
res 2 code:  200
res 3 code:  200
res 4 code:  200
res 5 code:  200
res 6 code:  200
res 7 code:  200
res 8 code:  200
res 9 code:  200
res 10 code:  200
res 11 code:  200
res 12 code:  200
res 13 code:  200
res 14 code:  200
res 15 code:  200
res 16 code:  200
res 17 code:  200
res 18 code:  200
res 19 code:  200
res 20 code:  200
res 21 code:  200
res 22 code:  200
res 23 code:  200
res 24 code:  200
res 25 code:  200
res 26 code:  200
res 27 code:  200
res 28 code:  200
res 29 code:  200
res 30 code:  200
res 31 code:  200
res 32 code:  200
res 33 code:  200
res 34 code:  200
res 35 code:  200
res 36 code:  200
res 37 code:  200
res 38 code:  200
res 39 code:  200
res 40 code:  200
res 41 code:  200
res 42 code:  200
res 43 code:  200
res 44 code:  200
res 45 code:  200
res 46 code:  200
res 47 code:  200
res 48 code:  200
res 49 code:  200
res 50 code:  200
res 51 code:  200
res 52 code:  200
res 53 code:  200
res 54 code:  200
res 55 code:  200
res 56 code:  200
r

### Check if the data frame looks as expected

In [298]:
df.head()

Unnamed: 0,title,company,location,summary,description,salary,rating
0,Pharmacy Technician (Walmart),Wal-Mart Pharmacy,"Richmond, CA 94806","Maintains pharmacy systems and technology (for example, inventory systems, automated medication counting machines) by preparing informational and related…","What you'll doMaintains pharmacy systems and technology (for example, inventory systems, automated medication counting machines) by preparing informational and related records and reports to ensure accurate data and system compliance with policies and regulations; monitoring inventory levels and expiration dates for pharmaceuticals; calibrating machines; applying settings to pill or capsule sp...",,3.5
1,"Application Architect (1.0 FTE, Days)",Lucile Packard Children's Hospital,"Menlo Park, CA (Lorelei Manor area)","Must have minimum of 3 years of experience with each of the following: Programming/Scripting, including Java, JavaScript, Angular JS, SQL, Shell, Perl, HTML;…","Information Technology\n1.0 FTE, 8 Hour Day Shift\n\nLucile Packard Children's Hospital Stanford is the heart and soul of Stanford Children’s Health. Nationally ranked and internationally recognized, our 311-bed hospital is devoted entirely to pediatrics and obstetrics. Our six centers of excellence provide comprehensive services and deep expertise in key obstetric and pediatric areas: brain &...",,4.0
2,Major Gifts Officer,Community Overcoming Relationship ABUSE,"San Mateo, CA 94403 (Bay Meadows area)",Manage the agency’s donor database (currently Bloomerang) and supervise administrative staff on completing daily tasks associated with donor data management.,"FULL TIME and EXEMPT\n\nReports to: Director of Development (IED FOR NOW)\n\nSchedule of Hours: M-F FT, weekends as needed\n\nPosition Remote Eligible: During the pandemic the role may be performed remote, however, THIS ROLE IS BASED OUT OF SAN MATEO, CA, AND THE SELECTED INDIVIDUAL MUST BE LOCAL TO THE AREA\n\nMandated Reporter: No\nABOUT THE ORGANIZATION\n\nCommunity Overcoming Relationship ...",,2.5
3,"Consulting Member of Technical Staff, OCI",Oracle,"Redwood City, CA",Expertise in relational database and NoSQL database.\nExperience with SQL and data management.\nNo Visa Sponsorship is available for this position.,"Consulting Member of Technical Staff, OCI-210007GT\n\nNo Visa Sponsorship is available for this position.\n\nApplicants are required to read, write, and speak the following languages: English\n\nPreferred Qualifications\nThe Oracle Cloud Infrastructure (OCI) team can provide you the opportunity to build and operate a suite of massive scale, integrated cloud services in a broadly distributed, m...",,3.8
4,Staff Software Engineer,Galileo Financial Technologies,"San Francisco, CA",Ability to both design schema and query data in a full-featured implementation of SQL.\nYou’ll be extending and maintaining the financial backend that powers our…,"Who we are\nWelcoming, collaborative and having the opportunity to make an impact - is how our employees describe working here. Galileo is a financial technology company that provides innovative and revolutionary software products and services that power some of the world's largest Fintechs. We are the only payments innovator that applies tech and engineering capabilities to empower Fintechs a...",,3.2


In [324]:
df.shape

(1147, 7)

- although the scraping shows more than 1100 rows, many are expected to be duplicates, so we want to drop the duplicated rows then export as a csv file.

In [326]:
df = df.drop_duplicates()

In [327]:
df.to_csv('job_df.csv', index=False)