Oil and Water data analysis for Undark
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
img
README.md
hl2010toPresent.csv
oilandwater.Rmd
top20crude.csv

README.md

Mapping workshop for environmental reporting

An analysis and visualization of federal data on crude oil spills in the U.S.

The following workshop, presented at the 2018 European Science Journalism Conference, walks you through a data analysis and visualization I carried out for Undark magazine on the risk of an oil spill surrounding the Dakota Access Pipeline debate.

This workshop serves as an introduction to finding, cleaning and visualizing federal data using tools like Google Sheets, Excel, Datawrapper and Carto.com.

Slides are at bit.ly/ECSJdataviz.

Full tutorial at github.com/aleszu/oilandwater.

Table of contents

  1. Data viz for environmental reporting
  2. How I found the data
  3. Data analysis and visualization
  4. Exploratory data visualization
  5. Mapping with Datawrapper (Beginner)
  6. Mapping with Carto (Intermediate)

Data viz for environmental reporting

The Financial Times is doing it

alt text

InsideClimate News is doing it

alt text

New York City is doing it

alt text

So why couldn't I do it?

alt text

What are some of things these protestors are so angry about?

alt text

How often do spills occur?

alt text

And when pipelines rupture, how much oil spills?

alt text

Where in the U.S. do these spills occur?

alt text

Could we map that?

alt text

How I found the data

I found data from the Pipeline and Hazardous Materials Safety Administration (PHMSA).

alt text

I downloaded a zip file.

alt text

The government was nice enough to tell me what every column corresponded to.

alt text

All I really had to do in Excel was filter, sort and count.

alt text

I dropped my cleaned up spreadsheet into Carto.com

alt text

And then I reported my story

I tracked down and got sources like Rosenfeld, Stafford, Bommer, Coleman and Horn – who did the spill risk analysis for DAPL – on the line and write up an article.

alt text

We published it and watch it blow up on Twitter!

alt text

Data analysis and visualization

For this workshop, let's start with a filtered version of the top 20 crude oil spills since 2010 by size. If you want the full zip file of PHMSA flagged incidents go here.

Data filtering and sorting

How did I get the top 20 crude oil spills by size?

Let's first inspect the full dataset in Google Sheets. Open this spreadsheet and click "Make a copy..." in the File menu. (Notice the title, hl2010toPresent, is the same spreadsheet I downloaded from the PHMSA website zip file.) Add a filter by clicking "Turn on filter" in the Data menu or clicking the filter icon.

alt text

Filter by COMMODITY_RELEASED_TYPE

We want to analyze only crude oil spills – since that's what our story is about – so we'll click the drop-down triangle next to COMMODITY_RELEASED_TYPE and click "Clear" under "Filter by values..." Next, scroll down and select "CRUDE OIL" and then click the blue "OK" button.

alt text

Sort NET_LOSS_BBLS by highest to lowest

We want to see which crude oil spills were the largest in this dataset, so click NET_LOSS_BBLS and click "Sort Z -> A" (Notice that COMMODITY_RELEASED_TYPE has a green filter and only CRUDE OIL rows are appearing. Now you know how I got the top 20 largest spills since 2010.

alt text

Exploratory data visualization

Descriptive statistics

Descriptive statistics is used to describe the features of a dataset. In the case of column NET_LOSS_BBLS, we'd like to know the average, the minimum, the maximum, and the count to get our heads around the data. Select the NET_LOSS_BBLS column by clicking "AE" and look at the bottom-right corner for a box listing Sum, Average, Min, Max, Count and Count Numbers. What do you notice about these spills?

alt text

Let's do the same for the UNINTENTIONAL_RELEASE_BBLS column and compare the Sum, Average, Min, Max, and Count. Notice how much was supposedly recovered. In calculating the total number of crude oil spills since 2010, I used the UNINTENTIONAL_RELEASE_BBLS column. In calculating the size of the spills, I used the NET_LOSS_BBLS, giving the operators the benefit of the doubt on the whole. (In my reporting, I found that many of these spills – especially smaller ones – leaked out into containment structures not unlike a "rain barrel in a bathtub" and could conceivably be cleaned up.)

Quick charting

Good data scientists, data visualizers and data journalists will all tell you that they tend to generate between 50 and 100 visualizations of their data before settling on the ones they're going to publish. Just like descriptive statistics, this exploratory data visualization is helpful in getting one's head around the data. For an idea of a quick graphic that is helpful in giving you context about the data, let's select and sort the "year" column A -> Z and then click "Chart" from the "Insert" tab on the menu. Choose "column chart" and select "Aggregate column R" at the bottom.

alt text

What are some other charts you could draw up?

Other questions you could ask of the data

Try asking yourself another question and answer it with this dataset. Try asking questions on:

  1. A commodity besides crude oil
  2. The riskiest pipeline operators
  3. Spills in specific states

Embedding your analysis in your story

Below are two places I embedded my statistical analysis of the PHMSA data. See if you can filter and sort the data to reach my conclusions.

"Since 2010, there have been more than 1,300 crude oil spills in the United States, according to data collected by the Pipeline and Hazardous Materials Safety Administration, a regulatory arm of the U.S. Department of Transportation: That’s one crude oil spill every other day."

"Of the 8.9 million gallons spilled since 2010, the agency has reported that over 70 percent, or 6.3 million gallons, has been recovered. Filtering PHMSA data to look at spills in onshore water crossings only, like rivers, however, the recovery rate drops to just 30 percent."

Mapping with Datawrapper

  1. Go to datawrapper.de and click "Create a Map." Then, click "Symbol maps."

alt text

  1. Search for "USA" and click "USA » US-States"

alt text

  1. Under add your data, click "import your dataset" button, click import dataset by latitude and longitude, and upload a CSV like this one of the top 20 crude spills by size.

alt text

  1. Next, adjust the column that will be mapped to the "size" of the symbols. Select "NET_LOSS_BBLS." Click "Proceed" at bottom of page.

alt text

  1. Click "Set symbol tooltip" to map some data to points that will be revealed once a user hovers over them. I've decided to put ONSHORE_COUNTY NAME and ONSHORE_STATE_ABBREVIATION in the title and NET_LOSS_BBLS in the body of the tooltip. You can add text, commas, colons and some basic HTML like a line break
    to design the tooltip. Scroll down and click "Save."

alt text

  1. Click the "Annotate" tab and add a title like "Top 20 crude oil spills since 2010."

alt text

  1. Click "Publish" and on the next page click "Publish Chart." Copy the URL or embed code.

alt text

Mapping with Carto

  1. Boot up Carto.com, create a login and upload the full hl2010toPresent CSV.

  2. Upload the CSV by connecting your dataset.

alt text

  1. After your dataset is loaded in, click "Create Map."

alt text

  1. No points should appear. Click "Geocode" and then define your parameters by selecting the "location_latitude" column for Latitude and "location_longitude" for the Longitude column.

alt text

  1. Click "Apply" and you should see points appear on your map.

alt text

  1. Click "Add new analysis" under the "Analysis" tab and select Filter.

alt text

  1. Select the column you want to filter by, in our case it's "commodity_released_type" and then show "crude oil."

alt text

  1. Next, unde the "Style" tab, select the "By value" button next to "Point size" and select the "net_loss_bbls" column.

alt text

You're now faced with the tough decision of binning your data's distribution when selecting the bubble size. Let's do a quick aside into data classification.

alt text

alt text

  1. For this workshop, I selected the "Equal Intervals" classification and a 4 to 40 ramp, highlighting only the largest spills. Change the color by selecting the "Point color" color bar. Change the point overlap to "xor" to achieve mine.

alt text

  1. Let's add in the shapefiles for the U.S.'s crude oil pipelines. Add the pipelines by clicking "Add New Layer" in the main layers screen. Those can be found in a zip from the U.S. Energy Information Administration. (We can also add a shapefile of the proposed DAPL route.)

alt text

  1. To add interactivity to your map, play with the "widgets" function. Add a "histogram" pulled from one of your columns, like "net_loss_barrels," and a "category" from the "iyear" column.

  2. Extra points if you can track down and plot the shapefile for the Standing Rock reservation.

Read the full story on Undark here.

alt text

Link to Undark interactive map here and workshop interactive map here.

Extras

Mapping nuclear data

Try mapping nuclear powerplants. Data from CarbonBrief.

European data

Shapefiles of France's AOC regions.

A great list from UPenn on European GIS data.

A shapefile of European natural gas pipelines here.