An interface for collecting and parsing Federal Register documents
Clone the code from github.
The module requires a file named
config.py in the root project directory. The config file must define a variable named
dataDir which points to the root directory where the data will be saved.
The module collects data from two sources:
dataCollection/downloadMetadata.pydownloads metadata describing Federal Register documents from the federalregister.gov API. Raw metadata is saved in annual zipped json files in
dataCollection/downloadXML.pydownloads the text of daily Federal Register documents from the govinfo.gov. Raw XML files are saved in
dataCollection/compileParsed.py builds parsed versions of the documents, where the XML is converted into Pandas data tables. These files are saved as pickled dataframes in
dataDir/parsed. Files are named by document number, which must be extracted from the XML itself (and occasionally contains errors). The XML files sometimes contain duplicate printings of the same document, but each document only appears once in the parsed directory.
The complete dataset can be downloaded from scratch or updated to the latest available data by running
The complete dataset is approximately 20GB in size.
Cleaned and processed data can be loaded through
loaders.py. The most important functions are:
loadInfoDFloads all document metadata as a single dataframe
iterParsediteratively loads available parsed documents
loadParsedloads a single parsed document