# PASDA Harvest

[Pennsylvania Spatial Data Access (PASDA](https://www.pasda.psu.edu) is the state’s comprehensive GIS clearinghouse. Most Pennsylvania statewide agencies and regional organizations provide their data through this site. Many counties and cities do as well.


Part 1: Parse the PASDA portal
1. Use the script `getURLs.py` to obtain a list of all of the records currently in PASDA. The resulting CSV will be called `URLS_{today's date}.csv` which is just a list of the landing pages for the datasets in the PASDA portal.


2. Use the `pasdaURLS_{today's date}.csv` and the `html2csv.py`script to scrape the metadata from the PASDA landing pages. This resulting CSV will be called `output_{today's date}.csv`.


Part 2: Extract the bounding boxes
Context: most of the records have supplemental metadata in ISO 19139 or FGDC format. The link to this document is found in the 'Metadata" column.
Although these files are created as XMLs, the link is a rendered HTML.

1. Create a CSV file that is a list of just the metadata file pages. See sample-inputMetadataUrls.csv for an example.

2. Use the getBbox.py script to parse the files and extract the bounding boxes.
3. Merge these values back into the output CSV from Part 1.



## Part 1: Get URLs

Purpose: This script will crawl the PASDA website and return all of the published dataset landing pages

### Import the modules.

In [1]:
import csv # The csv module implements classes to read and write tabular data in CSV format.
import time # This module provides various time-related functions.
import urllib.request # The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP)
from bs4 import BeautifulSoup # For pulling data out of HTML and XML files
import re
from urllib.request import urlopen

### Search for the ID of all of the published items.

PASDA's website layout does not have a page that lists all of the published datasets. Therefore, we will perform an  empty search using the keyword value '+' to find all results

In [None]:
resURL = 'https://www.pasda.psu.edu/uci/SearchResults.aspx?Keyword=+' # Setting URL
page = urllib.request.urlopen(resURL).read() # Reading content of webpage
soup = BeautifulSoup(page, 'html.parser') # Parse the page

In [None]:
# Table contains all dataset links that start with 'a' html tag
table = soup.find('table', id="DataGrid1")    
hrefs = table.findAll('a')

In [None]:
urls = [] # Empty List
for href in hrefs:
    url = 'https://www.pasda.psu.edu/uci/' + href['href'] # Loop with concatenating base URLS and href
    urls.append([url]) # Append URLS

### Write a CSV file that contains a list of the landing page URLs for each item.

In [None]:
# Write all dataset urls in csv file with actiondate
actionDate = time.strftime('%Y%m%d') # Action Date
with open(f'URLs_{actionDate}.csv', 'w') as fw: 
    writer = csv.writer(fw) # Write Rows 
    writer.writerows(urls) # URLS

## Part 2: Parse the discovery metadata on each page

Next, this script will scan each page listed in the URLs_{actionDate}.csv file to look for the following information:

- Title
- Date
- Publisher
- Description
- Metadata Link (a linked supplemental file - more details on this in Part 3)

The script will then write a new CSV file in the GeoBTAA template format that contains all of this parsed information as well as default values for other fields.

In [2]:
code = '08a-01'  # Portal Hub Header
accessRights = "Public" # Access Rights Header
accrualMethod = "HTML" # Accrual Method Header
dateAccessioned = time.strftime('%Y-%m-%d') # Data Accessioned Header
language = "eng" # Language Header
isPartOf = "08a-01" # Part of Portal Header
memberOf = "ba5cc745-21c5-4ae9-954b-72dd8db6815a" # Member of Link Header
provider = "Pennsylvania Spatial Data Access (PASDA)" # Provider Agency Header
resourceClass = "" # Resource Class Header
resourceType = '' # Resource Type Header
dateRange = '' # Date Range Header

In [7]:
# Extract exising urls from local csv file

urls = [] # Creates a container for URLs to be generated

## Read the rows from the output CSV from datasetURL.py

with open('URLs_20230218.csv') as fr: # Opens CSV file
    reader = csv.reader(fr)  # Reader object assigned
    for row in reader: # Scans each URL to be appended 
        urls.append(row) # Appended to each row


# Store parsed elements for all urls

parseElements = [] # Generates parsed elements for URLs to be stored (similar to above)


# Find values for the rest
# Loop through the list of URLs
for url in urls:
    # Open the URL and read the page content
    page = urllib.request.urlopen(url[0]).read()
    soup = BeautifulSoup(page, "html.parser")
    
    landingPage = str(url).strip("[']")
    iden = 'pasda-' + landingPage.rsplit("=",1)[1]

    # Find the values for the title, date, publisher, and description fields
    title = soup.find(attrs={'id': 'Label1'}).text.strip()
    date = soup.find(attrs={'id': 'Label2'}).text.strip()
    publisher = soup.find(attrs={'id': 'Label3'}).text.strip()
    description = soup.find(attrs={'id': 'Label14'}).text.strip()
    
    # Find the links for the metadata files and dataset downloads
    
    metadataLink = soup.find('a', href=True, text='Metadata') 
    downloadLink = soup.find('a', href=True, text='Download')
    
    metadata = "https://www.pasda.psu.edu/uci/" + metadataLink['href'] # Stripped if BeautifulSoup module fails or slips through
    try:
        download = downloadLink['href']
    except:
        download = ''
            
    parseElements.append([landingPage,iden,title,date,dateRange,publisher,provider,language,description,resourceClass,resourceType,metadata,download,code,isPartOf,memberOf,accessRights,accrualMethod,dateAccessioned])
    
    
# Generate action date with format YYYYMMDD

actionDate = time.strftime('%Y%m%d')

# Write outputs to local CSV file

with open(f'output_{actionDate}.csv', 'w') as fw: # Concatenates Fields
    fields = ['Information','ID','Title','Temporal Coverage','Date Range','Publisher','Provider','Language','Description','Resource Class','Resource Type','HTML','Download','Code','Is Part Of','Member Of','Access Rights','Accrual Method','Date Accessioned']

    writer = csv.writer(fw)           # Writes 
    writer.writerow(fields)           # Field Names
    writer.writerows(parseElements)   # Elements

    print('#### Job done ####')

#### Job done ####


## Part 3: Extract the bounding boxes

Context: The bounding box coordinates are not part of the discovery metadata on each item's landing page. However, most of the records have supplemental metadata in ISO 19139 or FGDC format. The link to this document is found in the 'Metadata" column. Although these files are created as XMLs, the link is a rendered HTML, so we can parse them using a similar method as shown in part 2.

1. Create a CSV file that is a list of just the metadata file pages. See sample-inputMetadataUrls.csv for an example.
2. Use the getBbox.py script to parse the files and extract the bounding boxes.
3. Merge these values back into the output CSV from Part 1.

In [None]:
portalMetadata = [] # The first three lines initialize an empty list called portalMetadata
# and create a new CSV file named "bbox-output.csv" with headers 'url' and 'bbox' using the csv module.
f = csv.writer(open('bbox-output.csv', 'w'))
f.writerow(['url','bbox'])

with open('metaLinks.csv','r') as harvest: # The with block opens the file and reads each as strings
     urls = csv.reader(harvest)
     for url in urls:
        portalMetadata.append(url)

for url in portalMetadata: # The for loop iterates over each url in portalMetadata
      try:
            page = urlopen(url[0]).read()
            soup = BeautifulSoup(page, "html.parser")
            pageLink =str(url)[1:-1].strip("\'")
      
            try: # The try-except blocks search the HTML of the web page for specific strings 
                  west = soup.find('i',string='West_Bounding_Coordinate:').next_sibling.strip()   
            except:
                  west = ''
            
            try:
                  south = soup.find('i',string='South_Bounding_Coordinate:').next_sibling.strip()   
            except:
                  south = ''
            
            try:
                  east = soup.find('i',string='East_Bounding_Coordinate:').next_sibling.strip()   
            except:
                  east = ''
            
            try:
                  north = soup.find('i',string='North_Bounding_Coordinate:').next_sibling.strip()   
            except:
                  north = ''
            
            bbox = west + ',' + south + ',' +east+','+north
      except:
            bbox = "missing" # Omits errors if none found to missing
       
      f.writerow([pageLink,bbox])