# Web Scraping with Python and Selenium - Demo for GIS Day 2021
A brief demo of scraping web data using Selenium

## Why scrape?

 - You want to analyze the data on a website but there is no way to export it
 - You want to work with web data inside your program but there's no API to retrieve it automatically
 - You want to keep regular screenshots and backups of a page and analyze the differences over time

## Selenium

 - Low learning curve with capability of complex use cases
 - Works within a browser window, scraping can be performed on most websites
    - Page elements can be clicked and fields can be filled
 - Free and open source with lots of documentation


### Other scraping library options

- Python [requests](https://docs.python-requests.org/en/latest/) and [BeautifulSoup  ](https://www.crummy.com/software/BeautifulSoup/)
   - For simple scraping of plain HTML pages
   - Requests library is widely used for API querying
   - BeautifulSoup's advanced HTML object selection can be used with Selenium
- [Scrapy](https://scrapy.org)
   - For advanced large-scale scraping projects
   - Has its own shell for testing scraping approach
   - Capable of advanced countermeasures against scraping detection

# Scraping demo

Some minimum requirements:

 - `selenium` Python library
 - Selenium webdriver executable
    - [Instructions and downloads](https://www.selenium.dev/documentation/getting_started/installing_browser_drivers/)

## Imports and settings

In [1]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.wait import WebDriverWait 
from xvfbwrapper import Xvfb
import time
import requests
import json
import base64
import os
import subprocess
import pandas

In [2]:
pageurl = 'https://www.gjc.org/cgi-bin/listjobs.pl?view=table'
headless = False

fp = webdriver.FirefoxProfile()
options = Options()
options.headless = headless

## Getting started - find page elements to scrape

### Open the driver

In [3]:
driver = webdriver.Firefox(fp,options=options)

### Open the page

In [4]:
driver.get(pageurl)

### Find the HTML element that contains what we want
In the inspector, we find that the html element is the `table` with the XHTML ID attribute `jobs_table`

In [8]:
table = driver.find_element_by_id('jobs_table')
table_html = table.get_attribute('outerHTML')

### Read the table with pandas

In [89]:
df = pandas.read_html(table_html)
# read_html creates a list of DFs, we need the first one
df = df[0]

df

Unnamed: 0,Date,Title,Organization,Location
0,2021-09-20,GIS Consultant,G2 Integrated Solutions,"San Ramon, Ca"
1,2021-09-20,Party Chief - Survey,"GISinc, A Continental Mapping Company","Sun Prairie, WI or work from home considered"
2,2021-09-17,GIS Technician 1,Planet Forward,"Plainfield, IN"
3,2021-09-17,GIS Technician – Part-Time,Theorem Geo,"Charlotte, NC"
4,2021-09-17,GISc Director,Trout Unlimited,"Boise, ID USA or remote"
5,2021-09-17,Geospatial Developer,Weyerhaeuser,"Seattle, WA; Springfield, OR; Columbia Falls, ..."
6,2021-09-17,GIS Tools Team Lead,Weyerhaeuser,"Seattle, WA; Springfield, OR; Columbia Falls, ..."
7,2021-09-17,GIS Intern (Paid - 2 openings),"Ducks Unlimited, Inc.","Rancho Cordova (Sacramento area), CA USA"
8,2021-09-16,Sr. GIS Data Architect / Project Manager,"Innovate!, Inc.",Salt Lake City Utah
9,2021-09-16,GIS Developer,"Innovate!, Inc.",Remote


## Let's scrape everything

In [55]:
# First, close out of the driver we opened earlier
driver.close()

In [80]:
# Reopen a new session
driver = webdriver.Firefox(fp,options=options)
driver.get(pageurl)

In [81]:
# And create a new Pandas dataframe that everything will go into
jobs_df = pandas.DataFrame(columns=['Date', 'Title', 'Organization', 'Location'])

### Get page numbers
We need to know when to stop hitting the "Next" button.

In [58]:
pages_ribbon = [a.text for a in driver.find_elements_by_class_name('paginate_button')]
pages_ribbon

['Previous', '1', '2', '3', '4', '5', '12', 'Next']

In [59]:
total_pages = int(pages_ribbon[-2])
total_pages

12

### Loop through the pages and save them to the dataframe

In [82]:
for i in range(total_pages):
    if i+1 != total_pages:
        
        table_html = driver.find_element_by_id('jobs_table').get_attribute('outerHTML')
        page_table = pandas.read_html(table_html)[0]
        jobs_df = pandas.concat([jobs_df, page_table]) # pandas.concat expects an interable
        
        next_button = driver.find_element_by_id('jobs_table_next')
        next_button.click()
        
        # Delay a little bit so we can see it happen
        time.sleep(.5)
        
    else:
        print('Done!')

done!


## Let's take a look at our work

In [90]:
jobs_df

Unnamed: 0,Date,Title,Organization,Location
0,2021-11-10,Geospatial Data Manager,NatureServe,"Arlington, VA"
1,2021-11-09,GIS Technician,Snohomish County Public Works-ES,"Everett, WA"
2,2021-11-09,GIS Technician I,Sedgwick County,"Wichita, KS"
3,2021-11-08,GIS Analyst,"CEMML, Colorado State University","Fort Wainwright, Alaska"
4,2021-11-08,Geospatial Data Specialist I or II (Sewer),City of Cedar Rapids,"Cedar Rapids, IA"
...,...,...,...,...
5,2021-09-17,Geospatial Developer,Weyerhaeuser,"Seattle, WA; Springfield, OR; Columbia Falls, ..."
6,2021-09-17,GIS Tools Team Lead,Weyerhaeuser,"Seattle, WA; Springfield, OR; Columbia Falls, ..."
7,2021-09-17,GIS Intern (Paid - 2 openings),"Ducks Unlimited, Inc.","Rancho Cordova (Sacramento area), CA USA"
8,2021-09-16,Sr. GIS Data Architect / Project Manager,"Innovate!, Inc.",Salt Lake City Utah


## Save the data to a CSV

In [87]:
jobs_df.to_csv('out.csv')

# Thanks for participating!

Get the repo here: https://github.com/karltach/GISDayScrapingDemo