# Dockerizing our Scraper

### Introduction

In this lesson we'll work on dockerizing our scraper.  This will be helpful in ultimately deploying our scraper and ensuring a more consistent environment when we do so.

Let's get started.

### Moving to Playwright

If you look at the `indeed_client.py` file, you'll notice that we have moved away from selenium, to use an alternative tool called `playwright`.  

Our main reason for using playwright is that is a bit easier to dockerize.  But playwright also offers other advantages -- for example it can run asynchronously (selenium cannot), and will automatically wait for elements to appear before say clicking on them.

You can see the documentation for getting started with playwright [here](https://playwright.dev/python/docs/library).

Now let's see how we can use playwright to boot up our indeed page.

```python
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync


def get_indeed_html():
  with sync_playwright() as p:
    browser = p.firefox.launch()
    page = browser.new_page()
    stealth_sync(page)
    url = 'https://www.indeed.com/jobs?q=data+engineer&l=New+York%2C+NY&vjk=de0e64293cefea2f'
    page.goto(url)
    content = page.content()
    browser.close()
    return content
```

There are a couple of other things to notice -- we are importing (and using) a function called `stealth_sync`.  This helps prevent websites from blocking our scraper.  We're using firefox (instead of chromium) for the same reason -- less susceptible to our scraper being blocked.

The rest of our code works the same.  We get the html with `page.content()`, return that html and then pass it to beautiful soup.

### Trying it out

Ok, so now let's make sure our code still works now that we are migrating to playwright.

> Note: By default playwright runs in headless mode (meaning we will not see the browser bootup).  However, we can change this by passing an argument into `firefox.launch` like so: `firefox.launch(headless=False)`.  Let's keep it headless for now.

Ok, so to get started with our code, we can `cd` into the `scraper` folder, run the requirements.txt file with `pip3 install -r requirements.txt`, and then run our scraper with the following:

```
python3 manage.py save_as_json
```

If you look at the manage.py file, you can see that this function runs our `save_to_json` function from `index.py`, which will scrape the indeed page, and store the results in the `/data` folder.  

After running the `python3 manage.py save_as_json` line, you should see some json in a `data.json` file.  Ok, now delete that `data.json`.  We're about to dockerize our scraper.

### Dockerizing our Code

Ok, now it's time to turn this into a docker image.  If you look at our Dockerfile, you'll see that we got you started.

```
FROM mcr.microsoft.com/playwright/python:v1.32.0-jammy

WORKDIR /home/pwuser

# Fill in code here

CMD ["python3", "manage.py", "save_as_json"]
```

The base image already has playwright installed on it, as well as Python.  We then change the working directory to `/home/pwuser` which is where we'd like to place our code.

We then end with running the command that will scrape our page and save our data to the `./data/data.json` file.

Ok, now it's time for you to get going.  Dockerize our code, so that we can successfully run the command.  

To make sure your code is setup correctly, it probably makes sense to bash into a container and try running the `python3 manage.py save_as_json` from there.

### Getting the data from the container

Ok, now that we are able to dockerize our code, we still would like to get the resulting data out of the container and onto our personal computer.  To accomplish this, use bind mounting to sync the data into your local directory.

> If done properly, you should see the resulting data.json file in your local directory.

<img src="./resulting-data.png" width="100%">