Skip to content

Allows for conversion of the US EPA's eGRID data set into RDF

Notifications You must be signed in to change notification settings

cbdavis/eGRID-to-RDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

eGRID to RDF

This code is used for converting the eGRID dataset from the US EPA into RDF, allowing for sophisticated queries to be run over the data. The eGRID data is available from the US EPA as a series of Excel files for several years. This is great for looking up facts for a single power plant for a single year, but this is not an ideal format if you want to do more extensive types of data mining.

Instructions

eGRID.R downloads the relevant files from the US EPA eGRID website, and then merges the data for all the years for the plants, generators, and boilers into three separate CSV files. Make sure that you have installed the required libraries which are listed at the top of the code.

After the R code is run, Google Refine is used to map the data in the CSV files to RDF. Make sure that you first have the RDF Refine extension installed.

You probably want to increase the amount of memory available to Google Refine as it will be used to generate rather large files (~400M each). If you're running from the command line, you need to do:

export REFINE_MEMORY=2048M
./refine

From there, load in the CSV files, and create a new project for each. The json files in this repository are records of the change history that we use to map from the CSV data to RDF. To use these, click on "Undo/Redo", then click on "Apply" and copy/paste the contents of the appropriate json file. From there, the files can be exported as RDF, TTL, etc. This may take several minutes.

Background

This code has been developed in our work on enipedia.tudelft.nl, which is an ongoing exploration of how sophisticated visualizations and data management techniques can help us to explore and get a better understanding of various energy and industry topics. While there is a growing amount of data being made publicly available, the data is not always published in formats that make it easy to process and navigate. The code here shows some of our efforts at fixing this.

Application Examples

The data is now in a graph format instead of a tabular format, which makes it easy to run queries that follow different pathways through the data.

The Navajo Generating Station (scroll to bottom of link) has apprently installed a SO2 scrubber within the past few years. If you were working with the data in Excel spreadsheet form, in order to find this data, you would have to search through eight spreadsheets, among 4000-5000 rows and 150 columns.

By converting the data to RDF, we are able to run SPARQL queries and very efficiently retrieve various views of the data. For example, by using the identifier in the ORISPL columns of the spreadsheets, the SPARQL query below (run at http://enipedia.tudelft.nl/sparql) will show all of the emissions for every year for that particular plant (results)

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX plant: <http://enipedia.tudelft.nl/data/eGRID/Plant/>
PREFIX egridprop: <http://enipedia.tudelft.nl/data/eGRID/prop/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?emissionName ?amount ?year from <http://enipedia.tudelft.nl/data/eGRID> {
  plant:4941 egridprop:Annual_Emissions ?emissions .
  ?emissions egridprop:Year ?year . 
  ?emissions egridprop:Amount ?amount . 
  ?emissions rdfs:label ?emissionName . 
} order by ?emissionName ?year 

A raw view of the data available for this power plant can be seen here via the Pubby Linked Data Frontend.

These example queries show how electricity production across nearly every U.S. state is increasing, as is CO2 emissions, however, the U.S. as a whole is decarbonizing in terms of CO2 emissions per MWh of generation. Each of the tables are generated via single quprees. CO2 emissions per generation output per state per year is found via the following query (results):

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX plant: <http://enipedia.tudelft.nl/data/eGRID/Plant/>
PREFIX egridprop: <http://enipedia.tudelft.nl/data/eGRID/prop/>
PREFIX egrid: <http://enipedia.tudelft.nl/data/eGRID/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
select ?state (sum(?emissionAmount)/sum(?genAmount) as ?intensity) ?year1 from <http://enipedia.tudelft.nl/data/eGRID> {
  ?plant rdf:type egrid:Plant . 
  ?plant egridprop:State_abbreviation ?state . 
  ?plant egridprop:Annual_Net_Generation ?generation .
  ?generation egridprop:Year ?year1 . 
  ?generation egridprop:Amount ?genAmount . 
  ?plant egridprop:Annual_Emissions ?emissions .
  ?emissions rdfs:label "CO2" . 
  ?emissions egridprop:Year ?year2 . 
  filter(xsd:double(?year1) = xsd:double(?year2)) . 
  ?emissions egridprop:Amount ?emissionAmount . 
} group by ?state ?year1 order by ?state ?year1

About

Allows for conversion of the US EPA's eGRID data set into RDF

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages