<a href="https://colab.research.google.com/github/claracavalcante/pycon-sweden-2020/blob/main/PyCon_Sweden_2020_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#PyCon Sweden 2020 - How to make opportunities fall into your lap?

## **Beautiful Soup**

Beautiful Soup is a Python library for extracting data from HTML and XML files. It works as an interpreter (parser) in order to provide more intuitive ways to browse, search and modify the information contained in any markup document.

In the first steps below, we will import the BeautifulSoup library and demonstrate some basic operations in an HTML example.

In [110]:
from bs4 import BeautifulSoup

In [111]:
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [112]:
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


### Important methods

In [113]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [114]:
soup.find('a')

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [115]:
soup.title

<title>The Dormouse's story</title>

In [116]:
soup.title.get_text()

"The Dormouse's story"

In [117]:
soup.select('body a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [118]:
soup.select('p > #link1')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [119]:
soup.select('body > .title')

[<p class="title"><b>The Dormouse's story</b></p>]

## **Remoteok.io**

[Link to Remoteok.io](https://remoteok.io/remote-dev-jobs)

Several libraries can be used to collect web pages. In this presentation we will use *urllib*.

In [120]:
from urllib.request import Request, urlopen

In [121]:
url = 'https://remoteok.io/remote-dev-jobs'
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

web_byte = urlopen(req).read()
web_page = web_byte.decode('latin-1') #'utf-8'

web_page



We will use BeautifulSoup as a parser for the collected content. The BeautifulSoup object represents the parsed document as a whole.

In [122]:
soup = BeautifulSoup(web_page, 'html.parser')

section = soup.find("table")

print(section)



In [123]:
jobs = []

for tr in section.find_all('tr', 'job'):
  job = {}
  if tr.find('td', 'source').a:
    job['company'] = tr.h3.get_text()
    job['role'] = tr.h2.get_text()
    
    #if tr.find('td', 'company_and_position').find('div', 'location'): 
      #job['location'] = tr.div.get_text().encode('ascii', 'replace')
      

    job['tags'] = []
    get_tags = tr.find('td', 'tags')
    for tag in get_tags.find_all('div', 'tag'):
      job['tags'].append(tag.h3.get_text())

    job['link'] = 'https://remoteok.io' + tr.find('td', 'source').a['href']
    jobs.append(job)

jobs

[{'company': 'Clevertech',
  'link': 'https://remoteok.io/l/100227',
  'role': 'Angular Developer',
  'tags': ['angular', 'javascript', 'java', 'angularjs']},
 {'company': 'Octopods',
  'link': 'https://remoteok.io/cdn-cgi/l/email-protection#a9ddc8dbccc28c9d99c6caddc6d9c6cdda87c0c696dadccbc3cccadd94e7ccde8c9b99c8d9d9c5c0cac8c7dd8c9b99cfdbc6c48c9b99fbccc4c6ddcc8c9b99e6e28c9b99cfc6db8c9b99faccc7c0c6db8c9b99efdbc6c7ddccc7cd8c9b99ecc7cec0c7ccccdb8c9b99c8dd8c9b99e6caddc6d9c6cdda8fcbc6cdd094e1c08c9b988c99e88c99e8e08c9b9ecd8c9b99c5c0c2cc8c9b99ddc68c9b99c8d9d9c5d08c9b99cfc6db8c9b99ddc1cc8c9b99d9c6dac0ddc0c6c78c9b99c6cf8c9b99faccc7c0c6db8c9b99efdbc6c7ddccc7cd8c9b99ecc7cec0c7ccccdb8c9b99c8dd8c9b99e6caddc6d9c6cdda8c9b99d9c6daddcccd8c9b99c6c78c9b99fbccc4c6ddcc8c9b99e6e287',
  'role': 'Senior Frontend Engineer',
  'tags': ['dev', 'javascript', 'react', 'vuejs']},
 {'company': 'IGBlade',
  'link': 'https://remoteok.io/l/100211',
  'role': 'Experienced Frontend Developer',
  'tags': ['vue', 'bootstra

In [124]:
import pandas as pd

jobs_pd = pd.DataFrame(jobs)

jobs_pd.head()

Unnamed: 0,company,role,tags,link
0,Clevertech,Angular Developer,"[angular, javascript, java, angularjs]",https://remoteok.io/l/100227
1,Octopods,Senior Frontend Engineer,"[dev, javascript, react, vuejs]",https://remoteok.io/cdn-cgi/l/email-protection...
2,IGBlade,Experienced Frontend Developer,"[vue, bootstrap, frontend, graphql]",https://remoteok.io/l/100211
3,Semaphore,Senior Software Engineer,"[software engineering, paas, elixir, go]",https://remoteok.io/l/100209
4,Forestreet,Front End Developer,"[react, javascript, typescript, rest apis]",https://remoteok.io/cdn-cgi/l/email-protection...


### Filtering?

In [125]:
ideal_technology = 'python'

In [126]:
ideal_level = 'Senior'

In [127]:
import copy
filtered_jobs = []

for job in jobs:
  if ideal_technology in job['tags'] and ideal_level in job['role']:
    filtered_jobs.append(copy.deepcopy(job))

filtered_jobs

[{'company': 'Clarity Movement Co.',
  'link': 'https://remoteok.io/l/99896',
  'role': 'Senior Backend Developer',
  'tags': ['python',
   'mongodb',
   'aws ec2 lambda api gatewayecs',
   'amplify django']},
 {'company': 'Very LLC',
  'link': 'https://remoteok.io/l/99730',
  'role': 'Senior Software Engineer React React',
  'tags': ['react', 'react native', 'javascript', 'python']},
 {'company': 'Netdata Inc',
  'link': 'https://remoteok.io/l/99258',
  'role': 'Senior Software Engineer',
  'tags': ['linux systems', 'problem solving', 'python', 'ruby']}]

## **We Work Remotely**

[Link to We Work Remotely](https://weworkremotely.com/)

In [128]:
from urllib.request import Request, urlopen

url = 'https://weworkremotely.com/categories/remote-programming-jobs#job-listings'
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

web_byte = urlopen(req).read()
web_page = web_byte.decode('utf-8')


soup = BeautifulSoup(web_page, 'html.parser')

jobs = []

for li in section.find_all('li', 'feature'):
  job = {}
  job['company'] = li.find_all('span', 'company')[0].get_text()
  job['title'] = li.find('span', 'title').get_text()
  if li.find('span', 'region'):
    job['region'] = li.find('span', 'region').get_text()
  job['period'] = li.find_all('span', 'company')[1].get_text()
  jobs.append(job)

jobs

[]

In [129]:
import pandas as pd

jobs_pd = pd.DataFrame(jobs)

jobs_pd.head()