# Mini Project 1: Construct Job Postings Archive by Scraping LinkedIn

This notebook is used to store and demonstrate the code to construct job postings archive by scraping linkedin.

Though a specific job position, data scientist, is selected here, I wish to build a data pipeline that could handle any job positions.

## Reference

- [A Complete Guide to Web Scraping LinkedIn Job Postings](https://maoviola.medium.com/a-complete-guide-to-web-scraping-linkedin-job-postings-ad290fcaa97f)
- [BeautifulSoup Documentations](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all)
- Course Slides and Codes

# 1 Scraping Data

In [18]:
import requests
import time
import random
import json
from bs4 import BeautifulSoup

In [13]:
position = 1
page = 0
url = 'https://www.linkedin.com/jobs/search/' + \
        '?keywords=data%20scientist' + \
        '&location=United%20States' + \
        '&position={}' + \
        '&pageNum={}'.format(position,page)

In [70]:
r = requests.get(url)

In [71]:
r

<Response [200]>

In [72]:
r.text

'<!DOCTYPE html>\n\n    \n    \n    \n\n    \n    <html lang="en">\n      <head>\n        <meta name="pageKey" content="d_jobs_guest_search">\n          <meta name="linkedin:pageTag" content="urlType=jserp_custom;emptyResult=false">\n        <meta name="locale" content="en_US">\n        <meta id="config" data-app-version="2.0.714" data-call-tree-id="HgJ9AzqgrhbAR375hSsAAA==" data-multiproduct-name="jobs-guest-frontend" data-service-name="jobs-guest-frontend" data-browser-id="13c56518-e97e-4f80-8198-f1f006a27121" data-enable-page-view-heartbeat-tracking>\n        <link rel="canonical" href="https://www.linkedin.com/jobs/data-scientist-jobs">\n<!----><!---->\n<!---->\n<!---->\n<!---->\n          <link rel="icon" href="https://static-exp1.licdn.com/sc/h/al2o9zrvru7aqj8e1x2rzsrca">\n\n          <script>\n            function getDfd() {let yFn,nFn;const p=new Promise(function(y, n){yFn=y;nFn=n;});p.resolve=yFn;p.reject=nFn;return p;}\n            window.lazyloader = getDfd();\n            w

In [74]:
soup = BeautifulSoup(r.text, 'html.parser')

In [75]:
soup.find('title')

<title>117,000+ Data Scientist jobs in United States (4,932 new)</title>

## Job Positions

In [76]:
# find all position names in this page
soup.find_all('span', {'class': 'screen-reader-text'})

[<span class="screen-reader-text">
             
         
         data scientist, People Analytics - Remote
       
       
           </span>,
 <span class="screen-reader-text">
             
         
         Data Scientist, Operations Data Science - Remote
       
       
           </span>,
 <span class="screen-reader-text">
             
         
         Data Scientist
       
       
           </span>,
 <span class="screen-reader-text">
             
         
         Data Scientist
       
       
           </span>,
 <span class="screen-reader-text">
             
         
         Data Scientist
       
       
           </span>,
 <span class="screen-reader-text">
             
         
         Entry Level Data Scientist: 2022
       
       
           </span>,
 <span class="screen-reader-text">
             
         
         Data Scientist
       
       
           </span>,
 <span class="screen-reader-text">
             
         
         Data Scientist- Stra

In [77]:
# Each page has 25 job positions posted
len(soup.find_all('span', {'class': 'screen-reader-text'}))

25

In [78]:
job_titles = soup.find_all('span', {'class': 'screen-reader-text'})
type(job_titles)

bs4.element.ResultSet

In [79]:
type(job_titles[0])

bs4.element.Tag

In [80]:
dir(job_titles[0])

['__bool__',
 '__call__',
 '__class__',
 '__contains__',
 '__copy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_all_strings',
 '_find_all',
 '_find_one',
 '_is_xml',
 '_lastRecursiveChild',
 '_last_descendant',
 '_should_pretty_print',
 'append',
 'attrs',
 'can_be_empty_element',
 'cdata_list_attributes',
 'childGenerator',
 'children',
 'clear',
 'contents',
 'decode',
 'decode_contents',
 'decompose',
 'decomposed',
 'descendants',
 'encode',
 'encode_contents',
 'extend',
 'extract',
 'fetchNextSiblings',
 'fetchParents',
 'fetchPrevious',
 'fetchPreviousSiblings',
 '

In [81]:
job_titles[0]

<span class="screen-reader-text">
            
        
        data scientist, People Analytics - Remote
      
      
          </span>

In [82]:
job_titles[0].string

'\n            \n        \n        data scientist, People Analytics - Remote\n      \n      \n          '

In [83]:
# Data Cleaning: replace 
' '.join(job_titles[0].string.replace('\n', ' ').split())

'data scientist, People Analytics - Remote'

In [84]:
job_titles_lst = []
for job_title in job_titles:
    job_titles_lst.append(' '.join(job_title.string.replace('\n', ' ').split()))
    
    

In [85]:
job_titles_lst

['data scientist, People Analytics - Remote',
 'Data Scientist, Operations Data Science - Remote',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist',
 'Entry Level Data Scientist: 2022',
 'Data Scientist',
 'Data Scientist- Strategy & Analytics',
 'Data Scientist- Strategy & Analytics',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist - US Cards',
 'Data Scientist - Payments - Austin',
 'Entry Level Data Scientist: 2022',
 'Data Scientist',
 'Entry Level Data Scientist: 2022',
 'Data Scientist',
 'Data Scientist - Lab222',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist - Core Intelligence',
 'Data Scientist',
 'Data Scientist']

## Job Responsibility

In [86]:
# find all position names in this page
soup.find_all('section', {'class': 'show_more_less_html show_more_less_html--more'})

[]