# PASDA Harvest

[Pennsylvania Spatial Data Access (PASDA](https://www.pasda.psu.edu) is the state’s comprehensive GIS clearinghouse. Most Pennsylvania statewide agencies and regional organizations provide their data through this site. Many counties and cities do as well.


Part 1: Parse the PASDA portal
1. Use the script `getURLs.py` to obtain a list of all of the records currently in PASDA. The resulting CSV will be called `URLS_{today's date}.csv` which is just a list of the landing pages for the datasets in the PASDA portal.


2. Use the `pasdaURLS_{today's date}.csv` and the `html2csv.py`script to scrape the metadata from the PASDA landing pages. This resulting CSV will be called `output_{today's date}.csv`.


Part 2: Extract the bounding boxes
Context: most of the records have supplemental metadata in ISO 19139 or FGDC format. The link to this document is found in the 'Metadata" column.
Although these files are created as XMLs, the link is a rendered HTML.

1. Create a CSV file that is a list of just the metadata file pages. See sample-inputMetadataUrls.csv for an example.

2. Use the getBbox.py script to parse the files and extract the bounding boxes.
3. Merge these values back into the output CSV from Part 1.



## Part 1: Get URLs

Purpose: This script will crawl the PASDA website and return all of the published dataset landing pages

### Import the modules.

In [None]:
import csv # The csv module implements classes to read and write tabular data in CSV format.
import time # This module provides various time-related functions.
import urllib.request # The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP)
from bs4 import BeautifulSoup # For pulling data out of HTML and XML files
import re
from urllib.request import urlopen


### Search for the ID of all of the published items.

PASDA's website layout does not have a page that lists all of the published datasets. Therefore, we will perform an  empty search using the keyword value '+' to find all results

In [None]:
# Start with the main search page
resURL = 'https://www.pasda.psu.edu/uci/SearchResults.aspx?Keyword=+'
page = urllib.request.urlopen(resURL).read()
soup = BeautifulSoup(page, 'html.parser')

# Identify landing page URLs inside <h3> tags
landing_page_links = soup.select('h3 a[href^="DataSummary.aspx?dataset="]')
landing_pages = ['https://www.pasda.psu.edu/uci/' + link['href'] for link in landing_page_links]

# For testing the code of just the first 5
landing_pages = landing_pages[:5]  # Keep only the first 5 landing pages

### Write a CSV file that contains a list of the landing page URLs for each item.

In [None]:
# Save the full landing pages to a CSV
# actionDate = time.strftime('%Y%m%d')
# with open(f'LandingPages_{actionDate}.csv', 'w', newline='') as fw:
#     writer = csv.writer(fw)
#     writer.writerow(["Landing Page URL"])  # Write header
#     for url in landing_pages:
#         writer.writerow([url])

## Part 2: Parse the discovery metadata on each page

Next, this script will scan each page listed in the URLs_{actionDate}.csv file to look for the following information:

- Title
- Date
- Publisher
- Description
- Metadata Link (a linked supplemental file - more details on this in Part 3)

The script will then write a new CSV file in the GeoBTAA template format that contains all of this parsed information as well as default values for other fields.

In [None]:
code = '08a-01'  # Portal Hub Header
accessRights = "Public" # Access Rights Header
accrualMethod = "HTML" # Accrual Method Header
dateAccessioned = time.strftime('%Y-%m-%d') # Data Accessioned Header
language = "eng" # Language Header
isPartOf = "08a-01" # Part of Portal Header
memberOf = "ba5cc745-21c5-4ae9-954b-72dd8db6815a" # Member of Link Header
provider = "Pennsylvania Spatial Data Access (PASDA)" # Provider Agency Header
resourceClass = "" # Resource Class Header
resourceType = '' # Resource Type Header
dateRange = '' # Date Range Header

In [None]:
parseElements = []

# Loop through the landing pages
for url in landing_pages:
    page = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(page, "html.parser")
    
    landingPage = url
    iden = 'pasda-' + landingPage.rsplit("=",1)[1]

    # Extract metadata fields
    title = soup.find(attrs={'id': 'Label1'}).text.strip()
    date = soup.find(attrs={'id': 'Label2'}).text.strip()
    publisher = soup.find(attrs={'id': 'Label3'}).text.strip()
    description = soup.find(attrs={'id': 'Label14'}).text.strip()
    
    metadataLink = soup.find('a', href=True, string='Metadata')
    downloadLink = soup.find('a', href=True, string='Download')
    
    metadata = "https://www.pasda.psu.edu/uci/" + metadataLink['href']
    try:
        download = downloadLink['href']
    except:
        download = ''
            
    parseElements.append([landingPage,iden,title,date,dateRange,publisher,provider,language,description,resourceClass,resourceType,metadata,download,code,isPartOf,memberOf,accessRights,accrualMethod,dateAccessioned])
    
# Generate action date with format YYYYMMDD
actionDate = time.strftime('%Y%m%d')

# Write outputs to a CSV file
with open(f'output_{actionDate}.csv', 'w') as fw:
    fields = ['Information','ID','Title','Temporal Coverage','Date Range','Publisher','Provider','Language','Description','Resource Class','Resource Type','HTML','Download','Code','Is Part Of','Member Of','Access Rights','Accrual Method','Date Accessioned']

    writer = csv.writer(fw)
    writer.writerow(fields)
    writer.writerows(parseElements)

    print('#### Job done ####')

## Part 3: Extract the bounding boxes

Context: The bounding box coordinates are not part of the discovery metadata on each item's landing page. However, most of the records have supplemental metadata in ISO 19139 or FGDC format. The link to this document is found in the 'Metadata" column. Although these files are created as XMLs, the link is a rendered HTML, so we can parse them using a similar method as shown in part 2.

1. Create a CSV file that is a list of just the metadata file pages. See sample-inputMetadataUrls.csv for an example.
2. Use the getBbox.py script to parse the files and extract the bounding boxes.
3. Merge these values back into the output CSV from Part 1.

In [51]:
# Dictionary to maintain the bounding boxes
bounding_boxes = {}
resource_type = {}

# Extract bounding boxes for the metadata URLs present in `parseElements`
for element in parseElements:
    metadata_url = element[11]  # This index might change if you modify your parseElements structure
    try:
        page = urlopen(metadata_url).read()
        soup = BeautifulSoup(page, "html.parser")
        
        try:
            west = soup.find('i', string='West_Bounding_Coordinate:').next_sibling.strip()   
        except:
            west = ''
        
        try:
            south = soup.find('i', string='South_Bounding_Coordinate:').next_sibling.strip()   
        except:
            south = ''
        
        try:
            east = soup.find('i', string='East_Bounding_Coordinate:').next_sibling.strip()   
        except:
            east = ''
        
        try:
            north = soup.find('i', string='North_Bounding_Coordinate:').next_sibling.strip()   
        except:
            north = ''
        
        bbox = west + ',' + south + ',' +east + ',' + north
    except:
        bbox = "missing"
        
    bounding_boxes[metadata_url] = bbox

            

# You might want to add a column in your parseElements for the bounding box or replace an existing one.

# Generate action date with format YYYYMMDD
actionDate = time.strftime('%Y%m%d')

# Write outputs to local CSV file
with open(f'output_{actionDate}.csv', 'w') as fw:
    fields = ['Information', 'ID', 'Title', 'Temporal Coverage', 'Date Range', 'Publisher', 'Provider', 'Language', 'Description', 'Resource Class', 'Resource Type', 'HTML', 'Download', 'Code', 'Is Part Of', 'Member Of', 'Access Rights', 'Accrual Method', 'Date Accessioned', 'Bounding Box']

    writer = csv.writer(fw)
    writer.writerow(fields)
    for element in parseElements:
        metadata_url = element[11]  # This index might change if you modify your parseElements structure
        bbox = bounding_boxes.get(metadata_url, 'missing')  # Get the bounding box or default to 'missing'
        element.append(bbox)  # Append the bounding box to the element. Adjust if you want to replace an existing column.
        writer.writerow(element)

    print('#### Job done ####')

#### Job done ####
