# Scraping SDWIS: Part 1

I'm focusing on the "Water System Details" (WSD) endpoint, at the URL <https://sdwis.waterboards.ca.gov/PDWW/JSP/WaterSystemDetail.jsp>

## Preliminary information

- JSP stands for [JavaServer Pages](https://en.wikipedia.org/wiki/JavaServer_Pages)
    - This is a backend software that interacts with a database and renders complete HTML documents server-side to be delivered to the client (the user's browser)
    - This means that it's not possible to analyze and intercept the client's requests to the backend to obtain the "raw" data, but instead we have to scrape the rendered HTML documents

## Getting the URLs

The first step to scrape the dataset is to obtain a list of URLs corresponding to each WSD page.  

An example URL is <https://sdwis.waterboards.ca.gov/PDWW/JSP/WaterSystemDetail.jsp?tinwsys_is_number=13798&tinwsys_st_code=CA>.

The primary URL parameter is `tinwsys_is_number`, which seems to be an internal database ID/primary key, and unfortunately looks like it's not related (at least, not obviously) to the PWSID. `tinwsys_is_number` look like an integer, so one possible choice at this point would be to assume that it's progressive, generate a sequence of integers to try, and query the server with the generated URLs, parsing the pages that we happen to guess correctly and ignore the rest.

Fortunately there's a better way, thanks to the Water System Search <https://sdwis.waterboards.ca.gov/PDWW/index.jsp> endpoint. By submitting a blank search form ([search URL](https://sdwis.waterboards.ca.gov/PDWW/JSP/SearchDispatch?number=&name=&county=&WaterSystemType=All&WaterSystemStatus=A&SourceWaterType=All&action=Search+For+Water+Systems)), we obtain a list of all 8329 PWSs in the form of a JQuery table; crucially, the "Water System No." column is a hyperlink containing the URL to the corrisponding WSD page.

It's in principle possible to extract the information from the table programmatically; however, since this is a one-time operation, I opted to:
- Manually select the "Display [All] records"; this will render the table as HTML in the browser
- Dowload the resulting page locally as a HTML text file
- Parse the HTML to extract the links

For this operation as well as for the parsing of WSD pages I'll use the `requests_html` library.

In [1]:
from requests_html import HTML

import pandas as pd

from config import PATH_HTML_URLS

In [2]:
def get_urls_from_html_table(html_text, url_pattern='WaterSystemDetail.jsp'):
    doc = HTML(html=html_text)
    
    return [url for url in doc.links if url_pattern in url]

In [3]:
URLS = get_urls_from_html_table(PATH_HTML_URLS.read_text())
len(URLS)

8328

In [4]:
URLS[:10]

['https://sdwis.waterboards.ca.gov/PDWW/JSP/WaterSystemDetail.jsp?tinwsys_is_number=4168&tinwsys_st_code=CA&wsnumber=CA4210022',
 'https://sdwis.waterboards.ca.gov/PDWW/JSP/WaterSystemDetail.jsp?tinwsys_is_number=5591&tinwsys_st_code=CA&wsnumber=CA5090901',
 'https://sdwis.waterboards.ca.gov/PDWW/JSP/WaterSystemDetail.jsp?tinwsys_is_number=13399&tinwsys_st_code=CA&wsnumber=CA0900615',
 'https://sdwis.waterboards.ca.gov/PDWW/JSP/WaterSystemDetail.jsp?tinwsys_is_number=14601&tinwsys_st_code=CA&wsnumber=CA1100101',
 'https://sdwis.waterboards.ca.gov/PDWW/JSP/WaterSystemDetail.jsp?tinwsys_is_number=13471&tinwsys_st_code=CA&wsnumber=CA2701474',
 'https://sdwis.waterboards.ca.gov/PDWW/JSP/WaterSystemDetail.jsp?tinwsys_is_number=4426&tinwsys_st_code=CA&wsnumber=CA4510007',
 'https://sdwis.waterboards.ca.gov/PDWW/JSP/WaterSystemDetail.jsp?tinwsys_is_number=2230&tinwsys_st_code=CA&wsnumber=CA1900146',
 'https://sdwis.waterboards.ca.gov/PDWW/JSP/WaterSystemDetail.jsp?tinwsys_is_number=189&tinwsy

Let's save this information in a table so that it's easily available.

To do so, it's useful to extract the values of the query parameters from the URL.

In [5]:
from urllib.parse import urlparse, parse_qs

In [6]:
def get_params_from_url(url):
    parsed = urlparse(url)
    params = parse_qs(parsed.query)
    # each value of params is a 1-element list, so we take the first element
    return {key: val[0] for key, val in params.items()}

In [7]:
df_urls = (
    pd.DataFrame([dict(url=url, **get_params_from_url(url)) for url in URLS])
    .rename(columns={'tinwsys_is_number': 'pws_url_id', 'wsnumber': 'pws_id'})
    [['pws_id', 'pws_url_id', 'url']]
    .set_index('pws_id')
    .sort_index()
)
df_urls.head(20)

Unnamed: 0_level_0,pws_url_id,url
pws_id,Unnamed: 1_level_1,Unnamed: 2_level_1
CA0103040,3,https://sdwis.waterboards.ca.gov/PDWW/JSP/Wate...
CA0103041,4,https://sdwis.waterboards.ca.gov/PDWW/JSP/Wate...
CA0105002,15,https://sdwis.waterboards.ca.gov/PDWW/JSP/Wate...
CA0105003,16,https://sdwis.waterboards.ca.gov/PDWW/JSP/Wate...
CA0105008,18,https://sdwis.waterboards.ca.gov/PDWW/JSP/Wate...
CA0105009,19,https://sdwis.waterboards.ca.gov/PDWW/JSP/Wate...
CA0105010,20,https://sdwis.waterboards.ca.gov/PDWW/JSP/Wate...
CA0105012,22,https://sdwis.waterboards.ca.gov/PDWW/JSP/Wate...
CA0105013,23,https://sdwis.waterboards.ca.gov/PDWW/JSP/Wate...
CA0105016,26,https://sdwis.waterboards.ca.gov/PDWW/JSP/Wate...


In [8]:
from config import PATH_URLS
PATH_URLS

PosixPath('data-out/wsd-urls.csv')

In [9]:
df_urls.to_csv(PATH_URLS)