In [1]:
import pandas as pd
import pathlib
import requests
import json

In [2]:
'''
Loading Notices of Inventory via Federal Register

fedreg_notices_of_inventory.csv downloaded from Federal Register 
search for "Inventory Completion"
agency: National Park Service
type: Notice
'''

data_path = pathlib.Path.cwd() / 'fedreg_notices_of_inventory.csv'
fed_inventories = pd.read_csv(data_path)
fed_inventories['html_url']

0      https://www.federalregister.gov/documents/2020...
1      https://www.federalregister.gov/documents/2019...
2      https://www.federalregister.gov/documents/2019...
3      https://www.federalregister.gov/documents/2019...
4      https://www.federalregister.gov/documents/2019...
                             ...                        
995    https://www.federalregister.gov/documents/2011...
996    https://www.federalregister.gov/documents/2011...
997    https://www.federalregister.gov/documents/2013...
998    https://www.federalregister.gov/documents/2013...
999    https://www.federalregister.gov/documents/2013...
Name: html_url, Length: 1000, dtype: object

#### Problem: Only 1000 results included in Federal Register's dataset, although 2697 results found in search. 

#### Instead we'll try to start building our dataset from the NPS site. (We can try to load in the documents that exist in the Federal Register but not in NPS' table (i.e. corrections) by JSON linkage later.)

In [4]:
'''
Loading notices of inventory via national park service (nps) json

Found URL for JSON with data via pagesource on 
https://www.nps.gov/subjects/nagpra/notices-of-inventory-completion.htm
'''

url = 'https://www.nps.gov/common/uploads/sortable_dataset/nagpra/F8663396-E1B9-7C54-8C15C08D2D0702C4/F8663396-E1B9-7C54-8C15C08D2D0702C4.json'
response=requests.get(url)
inventories_dict = json.loads(response.content)
nps_inventories = pd.DataFrame(data= \
                               json.loads(response.content)['DATA'], \
                               columns=json.loads(response.content)['COLUMNS'])
nps_inventories

Unnamed: 0,Publication Date,Title,Link
0,12/9/2019,"Sam Noble Oklahoma Museum of Natural History, ...",https://www.federalregister.gov/documents/2019...
1,11/27/2019,"Tennessee Valley Authority, Knoxville, TN",https://www.federalregister.gov/documents/2019...
2,11/27/2019,"Tennessee Valley Authority, Knoxville, TN",https://www.federalregister.gov/documents/2019...
3,11/27/2019,"University of California, Santa Cruz, Santa Cr...",https://www.federalregister.gov/documents/2019...
4,11/27/2019,"Los Angeles Pierce College, Woodland Hills, CA",https://www.federalregister.gov/documents/2019...
...,...,...,...
2462,7/2/1994,Notice of Inventory Completion for Native Amer...,https://www.federalregister.gov/documents/1994...
2463,2/28/1994,Inventory Completion of Native American Human ...,https://www.federalregister.gov/documents/1994...
2464,2/28/1994,Inventory Completion for Native American Human...,https://www.federalregister.gov/documents/1994...
2465,2/25/1994,Notice of Completion of Inventory of Native Am...,https://www.federalregister.gov/documents/1994...


#### Dataset from National Park Services returns 2467 results

In [21]:
'''
Trying to open Federal Register document
via NPS Notices of Inventories dataframe

Goal is to get metadata as well as full text for each record

Testing on first record
'''

test_url = nps_inventories['Link'] [0]
test_response=requests.get(test_url)
test_response.content



#### Appending '.json' to URL does not direct to json as expected. 

#### HTML (saved as test_response.content) contains URLs for JSON (containing metadata) and XML (containing original full text). 

#### The question now is how to isolate these URLs of interest from the HTML.  