## 12 Web scraping
"Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work."[[1]](https://beautiful-soup-4.readthedocs.io/en/latest/)<br><br>
The easiest way to install Beautiful Soup is via pip: <i>pip install beautifulsoup4</i><br>
The requests package [[2]](https://pypi.org/project/requests/), which is a simple HTTP library to handle HTTP requests, can also be installed via pip: <i>pip install requests</i><br><br>
This notebook contains some useful examples on how to use bs4, but you can find alot more additional information on installation process, features and everything else you might need for working with this package [here](https://beautiful-soup-4.readthedocs.io/en/latest/).

### Basics

In [None]:
import requests
from bs4 import BeautifulSoup

Use the request package to get the contents of the given URL:

In [None]:
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)

print(page.text)

There are different types of parsers available for bs4 (like html, xml etc.). We are utilizing the html.parser here as we are dealing with webpages.

In [None]:
soup = BeautifulSoup(page.content, "html.parser")
print(f"type(soup): {type(soup)}\n")

In [None]:
print(soup)

By using find() you will get all the content in the page which comes after your search target.

In [None]:
results = soup.find(id="ResultsContainer")
print(results)

"The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string:"[[3]](https://beautiful-soup-4.readthedocs.io/en/latest/#making-the-soup)

In [None]:
print(results.prettify())

Return the first element of a specific HTML-tag:

In [None]:
results.div

Return all elements with a specific HTML-tag:

In [None]:
print(len(list(results.find_all('p'))))

Return all elements with a specific HTML-tag and class:

In [None]:
print(len(list(results.find_all('p', class_="location"))))

In [None]:
print(list(results.find_all('p'))[0])
print()
print(list(results.find_all('p'))[1])
print()
print(list(results.find_all('p'))[2].text)
print()
print(list(results.find_all('p'))[3].text)

### Find all job offers

As the page we are looking at is all about fake job offers, let's see what they got:

In [None]:
job_elements = results.find_all("div", class_="card-content")
print(job_elements)

To get a better and clearer overview we can print the job_elements in the following way:

In [None]:
for job_element in job_elements:
    print(job_element, end="\n"*2)
    print()
    print()

There are multiple different job_elements as seen above, each of them representing one open job offer. We now want to extract only the inforamtion which are relevant to us, in this case the job title, company name and job location. For this we iterate over all found jobs and then search for the specific HTML elements we are looking for:

In [None]:
for job_element in job_elements:
    print(type(job_element))
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element)
    print(company_element)
    print(location_element)
    print()


#### Extract text
We did find the elements we were looking for, yet they still contain unnecessary information like the HTML syntax. By using ".text" for each retrieved element we can get rid of the HTML syntax:

In [None]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element.text)
    print(company_element.text)
    print(location_element.text)
    print()


This is still not exactly as we would like it as the returned texts are oftentimes not formatted correctly or contain spaces, like seen above. In order to strip away the empty spaces you have to slightly modify your code like follows:

In [None]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    # print(len(title_element.text))
    # print(len(title_element.text.strip()))
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()


If you are looking for a specific job offer you can again use the find_all() method to return all elements which exactly contain a certain string:

In [None]:
python_jobs = results.find_all("h2", string="Python")
print(f"python_jobs: {python_jobs}")

python_jobs = results.find_all("h2", string="Software Engineer (Python)")
print(f"python_jobs: {python_jobs}")

To make this feature even more useful you can utilize a lambda function to be more flexible when it comes to the evaluation of search terms:

In [None]:
python_jobs = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
)
print(f"python_jobs: {python_jobs}")

### Expand found elements
The following command allows you to return the parents of your chosen element(s):

In [None]:
soup.title.parent

In [None]:
python_jobs = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
)

python_job_elements = [
    h2_element.parent.parent.parent for h2_element in python_jobs
]


In [None]:
for el in python_job_elements:
    print(el)
    print()
    print()

#### Extracting URLs
Sometimes you may not just want to extract the text from found elements, but also URLs contained in them. Using .text will not deliver what we want:

In [None]:
for job_element in python_job_elements:
    # -- snip --
    links = job_element.find_all("a")
    for link in links:
        print(link.text.strip())

By using ["href"] we can directly access the hyperlink information contained inside the HTML element:

In [None]:
for job_element in python_job_elements:
    # -- snip --
    links = job_element.find_all("a")
    for link in links:
        link_url = link["href"]
        print(f"Apply here: {link_url}\n")

In [None]:
for job_element in python_job_elements:
    # -- snip --
    links = job_element.find_all("a")
    for link in links:
        link_url = link["href"]
        if link_url=="https://www.realpython.com":
            continue
        print(f"Apply here: {link_url}\n")