# Beautiful Soup Lab

### Introduction

In this lesson, we'll write a scraper to look through data engineering jobs on Indeed.com.  In doing so, we'll use the selenium library combined with the beautiful soup library.

### Getting Started

Let's begin by exploring the Indeed.com website.  In doing so, what we're looking for is the url we can make a request to, that we can ultimately scrape.

Ok, so go to Indeed.com, and then see how it works by typing in the job title `Data Engineer`, and a location of `New York, NY`, then click on `Find Jobs`.

<img src="data-eng-jobs.png" width="100%">

Finally, click on the second page of results.

<img src="./second-results.png" width="40%">

The key thing to really pay attention to is the url at the top as we navigate the website.  As we can see we have a url of `indeed.com/jobs` with various parameters.  

The `start=10` is a pagination parameter, which allows us to page page through results.  

> So here, we are not seeing the results at the very top, but from number 10 on, as we are on the second page and there are 10 results per page.

<img src="./indeed-url.png" width="80%">

Ok, so now it's time to write our first function.  Before doing so, first create a new python environment, and activate the environment.

Then install the necessary libraries for the project, which are listed in the `requirements.txt` file.

You can install these by running:

`pip3 install -r requirements.txt`

Then, you can run the tests for the `indeed_client` with the command:

```bash
python3 -m pytest tests/test_indeed_client.py
```

### Working with the Indeed client

Ok, so the first file we should work is the `indeed_client.py` file.  By client, we mean something that interacts directly with the external website -- `indeed.com`.

In that file, we wrote a function called `get_indeed_html` uses selenium to make a request to indeed.com.  It should automatically install the chrome driver, which you can see  more information about [here](https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/).



* Parsing jobs

Now the function above returns the HTML from the entire page, so we now would like to write a function called `get_job_cards` calls our `get_indeed_html` function, and then selects the list of 25 job cards on each page.  Notice that the relevant content appears to be located in the `td` items.

> We started the function for you, but complete it so that it returns a list of the tds.  Pass the related test in the `test_indeed_client.py` file.

<img src="./result-content.png" width="40%">

### Indeed Adapter

Ok, so now that we saw how we retrieve the list of cards with the client, the next step is to move onto the adapter file.  Remember that we saw that our client is what makes requests directly to the webpage.  

Well the adapter then takes that information and extracts the related information from it.  For our project, the adapter will take in our beautiful soup html object for a single data engineering position, and retrieve information about that position.

If you look at the `test_indeed_adapter.py` file, you can see that we pass into the html for a single card, create a beautiful soup object out of it, and then ask our adapter to extract the related information.

While the adapter extracts information for a single card, you can see that in the `index.py` file, we have a `run` function which loops through all of the cards using the adapter to extract information from each one.

Ok, let's get started.

In the `indeed_adapter.py` file, write code for the following functions:

* `get_id` returns the related id for a position
* `get_company_name` returns the company name of a position
* `get_title` returns the job title
* `get_salary_text` returns just the related salary text from a card (see the related test)
* `clean_salary` converts each listed salary in the salary_text to a float.
* `get_salaries` returns a list of both the minimum and maximum salary listed for a position
* `get_city_state` returns a tuple of the city and state for a position

> Hint: For `get_id` look to the `a tag` nested inside of an individual `td`, and on that a tag, you can find the `data-jk` attribute that has the id.

<img src="./data-jk-a.png" width="70%">

### Creating Models

Next create a class called position, located in the `models/position.py` file.  The class should create an instance of a position which can be initialized with `id`, `title`, `salaries`, and `city`, `state`, and `company_name` of the related position.  But it should return an instance that has attributes of min_salary and max_salary instead of a single salaries attribute (see the related test in test_position.py).  

Finally we'll want to write a function called `run`, located in the IndeedAdapter.  The `run` function in the IndeedAdapter should call the previously written adapter functions to retrieve all of the information related to a position, and then create an instance of the Position class, to create a position.

You can see that our `save_to_json` function first calls `run` to get the list of positions and then converts each position to a dictionary to eventually write as json.  If everything works, we should be able to call the save_to_json function and see the resulting json in our data folder.  We should see something like the following:

<img src="./json-data.png" width="70%">

### Summary

In this lesson, we used our knowledge of requests, beautiful soup and objects to build an indeed scraper.  The pattern that we used is called the adapter pattern.  With that pattern, we used a *client* to interact directly with the web site, and then passed the retrieved information to the adapter which extracted the related information and created a position instance.