# Web Scraping Project

Hi, everyone! In this notebook, I will scrape some data from Fake Python job postings, an example website provided by [Real Python](https://realpython.com) for practicing. As this is pretty much an exercise on what has been [posted](https://realpython.com/beautiful-soup-web-scraper-python/) on their website, I would like to thank Martin Breuss for putting together the tutorial!

Today, I will build a web scraper that fetches Python **software developer** job listings from the [Fake Python Jobs](https://https://realpython.github.io/fake-jobs/) site.

Import libraries and retrieve the HTML:

In [None]:
import requests
from bs4 import BeautifulSoup
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)

print(page.text)



Create a Beautiful Soup object:

In [None]:
soup = BeautifulSoup(page.content, 'html.parser') #page.content (HTML) taken as input
                                                  #'html.parser' makes sure you use the right parser for HTML content

## Find elements by ID

Let us look for the element "ResultsContainer", which contains all the job postings:

In [None]:
results = soup.find(id='ResultsContainer')

Let us also use `prettify()` to provide an easier viewing of the Beautiful Soup object:

In [None]:
print(results.prettify())

## Find Elements by HTML Class Name

All job postings are wrapped in a `<div>` element with the class `card-content`. Let us know use `results` and select only the job postings in it:

In [None]:
job_elements = results.find_all('div', class_='card-content')

`.find_all()` returns an iterable containing all the HTML for all the avaliable job listings on the page. Let us take a look at them:

In [None]:
for job_element in job_elements:
    print(job_element, end='\n'*2)

Still, too much HTML. Let us pick out the elements we want from each job posting with `.find()`:


In [None]:
for job_element in job_elements:
    title_element = job_element.find('h2', class_='title')
    company_element = job_element.find('h3', class_='company')
    location_element = job_element.find('p', class_='location')
    print(title_element)
    print(company_element)
    print(location_element)
    print()

Each `job_element` is another `BeautifulSoup()` object, which means that this method can be used on the parent element, `results`.

## Extract Text From HTML Elements


You can add `.text` to a Beautiful Soup object to return **only** the text content of the HTML elements that the object contains:

In [None]:
for job_element in job_elements:
    title_element = job_element.find('h2', class_='title')
    company_element = job_element.find('h3', class_='company')
    location_element = job_element.find('p', class_='location')
    print(title_element.text)
    print(company_element.text)
    print(location_element.text)
    print()

Let us get rid of the extra **whitespace** with `.strip()` method:

In [None]:
for job_element in job_elements:
    title_element = job_element.find('h2', class_='title')
    company_element = job_element.find('h3', class_='company')
    location_element = job_element.find('p', class_='location')
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()

## Find Elements by Class Name and Text Content

We are still looking for positions as software developer, therefore we need to filter the results. Since titles are kept within `<h2>` elements, we can use the `string` argument to get specific outputs:

In [None]:
python_jobs = results.find_all('h2', string='Python') #finds all <h2> elements where the contained string matches "Python" exactly

In [None]:
print(python_jobs)

Hey, why is the result empty? That is because `string=` is spelling, whitespace, and case sensitive, which means that any difference will impinge on the output. 

## Pass a Function to a Beautiful Soup Method


You can sometimes pass **functions** as arguments to Beautiful Soup methods:

In [None]:
python_jobs = results.find_all(
    'h2', string=lambda text: 'python' in text.lower()
)

This is an **anonymous function** to the `string=` argument. The **lambda function** looks at the text of each `<h2>` element, converts it to lowercase, and checks whether the substring "python" is found anywhere. Let us check it out:

In [None]:
print(len(python_jobs))

Great! We have found 10 jobs that include the word "python" in their job title. 

## Access Parent Elements

Stepping up in the hierarchy of the DOM starting from the identified `<h2>` elements may help. Let us find the `<h2>` element that contains the **job title** as well as its **closest parent element** that contains all the information that you’re interested in. In this case, we are looking for the `<div>` element with the `card-content` class, which contains all the information we need. We can now use `python_jobs` to fetch their great-grandparent elements:

In [None]:
python_jobs = results.find_all(
    'h2', string=lambda text: 'python' in text.lower()
)

python_job_elements = [
    h2_element.parent.parent.parent for h2_element in python_jobs
]

This is a **list comprehension** that works on each of the `<h2>` title elements in `python_jobs` that were retrieved by filtering with the lambda expression. Let us now adapt the code to the `for` loop to iterate:

In [None]:
for job_element in python_job_elements:
    title_element = job_element.find('h2', class_='title')
    company_element = job_element.find('h3', class_='company')
    location_element = job_element.find('p', class_='location')
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()

## Extract Attributes From HTML Elements


Now, we need to scrape the link to apply for a job. However, if we try to find link elements in the same way as we have been doing so far, we will not retrieve the URLs for which we are looking:

In [None]:
for job_element in python_job_elements:
    title_element = job_element.find('h2', class_='title')
    company_element = job_element.find('h3', class_='company')
    location_element = job_element.find('p', class_='location')
    links = job_element.find_all('a')
    for link in links:
        print(link.text.strip())

See? The results display only the link texts **Learn** and **Apply** instead of the associated URLs. And why so? Because the `.text` attribute displays only the **visible** content of an HTML content: all attributes, including those containing URLs, are stripped away. We need to extract the value of one of the HTML attributes.
We can do that by looking for the `href` attribute at the HTML of a single job posting. Example:

In [None]:
%%html
<!-- snip -->
    <footer class="card-footer">
        <a href="https://www.realpython.com" target="_blank"
           class="card-footer-item">Learn</a>
        <a href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html"
           target="_blank"
           class="card-footer-item">Apply</a>
    </footer>
  </div>
</div>



Now, we can extract the value of these `href` attributes in each `<a>` element.

In [None]:
for job_element in job_elements:
    title_element = job_element.find('h2', class_='title')
    company_element = job_element.find('h3', class_='company')
    location_element = job_element.find('p', class_='location')
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    links = job_element.find_all('a')
    for link in links:
        link_url = link['href']
        print(f"Apply here: {link_url}\n")

Nice, we fetched all links from the job postings we had previously filtered. We the extracted the `href` attribute contaning URLs. Still, not all the links are useful to our query. Let us refine our results a little bit further.

Exercise - how do we retreive only the URL of the *second* link?

In [None]:
for job_element in python_job_elements:
    title_element = job_element.find('h2', class_='title')
    company_element = job_element.find('h3', class_='company')
    location_element = job_element.find('p', class_='location')
    link_url = job_element.find_all('a')[1]['href']
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print(f"Apply here: {link_url}\n")
    
    

And there you have it! Awesome! We have managed to scrape valuable data from a bunch of job postings!

In [None]:
import pandas as pd

In [None]:
list_of_lists = []
for job_element in python_job_elements:
    list_of_page = []
    title_element = job_element.find('h2', class_='title')
    company_element = job_element.find('h3', class_='company')
    location_element = job_element.find('p', class_='location')
    link_url = job_element.find_all('a')[1]['href']
        
            
    list_of_page.append(title_element)
    list_of_page.append(company_element)
    list_of_page.append(location_element)
    list_of_page.append(link_url)
    list_of_lists.append(list_of_page)

df = pd.DataFrame(list_of_lists, columns=['v_1', 'v_2', 'v_3', 'v_4'])

df.to_csv('jobs-list.csv', index=False, sep=';')

Well, that was a botched attempt to store retrieved data as a `.csv` file. That will be subject of another notebook! Thank you again Martin Breuss and [Real Python](https://realpython.com) for the amazing content that you guys provide everyday!