Mapping workshop for environmental reporting
An analysis and visualization of federal data on crude oil spills in the U.S.
The following workshop, presented at the 2018 European Science Journalism Conference, walks you through a data analysis and visualization I carried out for Undark magazine on the risk of an oil spill surrounding the Dakota Access Pipeline debate.
This workshop serves as an introduction to finding, cleaning and visualizing federal data using tools like Google Sheets, Excel, Datawrapper and Carto.com.
Slides are at bit.ly/ECSJdataviz.
Full tutorial at github.com/aleszu/oilandwater.
Table of contents
- Data viz for environmental reporting
- How I found the data
- Data analysis and visualization
- Exploratory data visualization
- Mapping with Datawrapper (Beginner)
- Mapping with Carto (Intermediate)
Data viz for environmental reporting
So why couldn't I do it?
What are some of things these protestors are so angry about?
How often do spills occur?
And when pipelines rupture, how much oil spills?
Where in the U.S. do these spills occur?
Could we map that?
How I found the data
data from the Pipeline and Hazardous Materials Safety Administration (PHMSA).I found
I downloaded a zip file.
The government was nice enough to tell me what every column corresponded to.
All I really had to do in Excel was filter, sort and count.
I dropped my cleaned up spreadsheet into Carto.com
And then I reported my story
I tracked down and got sources like Rosenfeld, Stafford, Bommer, Coleman and Horn – who did the spill risk analysis for DAPL – on the line and write up an article.
Data analysis and visualization
Data filtering and sorting
How did I get the top 20 crude oil spills by size?
Let's first inspect the full dataset in Google Sheets. Open this spreadsheet and click "Make a copy..." in the File menu. (Notice the title, hl2010toPresent, is the same spreadsheet I downloaded from the PHMSA website zip file.) Add a filter by clicking "Turn on filter" in the Data menu or clicking the filter icon.
Filter by COMMODITY_RELEASED_TYPE
We want to analyze only crude oil spills – since that's what our story is about – so we'll click the drop-down triangle next to COMMODITY_RELEASED_TYPE and click "Clear" under "Filter by values..." Next, scroll down and select "CRUDE OIL" and then click the blue "OK" button.
Sort NET_LOSS_BBLS by highest to lowest
We want to see which crude oil spills were the largest in this dataset, so click NET_LOSS_BBLS and click "Sort Z -> A" (Notice that COMMODITY_RELEASED_TYPE has a green filter and only CRUDE OIL rows are appearing. Now you know how I got the top 20 largest spills since 2010.
Exploratory data visualization
Descriptive statistics is used to describe the features of a dataset. In the case of column NET_LOSS_BBLS, we'd like to know the average, the minimum, the maximum, and the count to get our heads around the data. Select the NET_LOSS_BBLS column by clicking "AE" and look at the bottom-right corner for a box listing Sum, Average, Min, Max, Count and Count Numbers. What do you notice about these spills?
Let's do the same for the UNINTENTIONAL_RELEASE_BBLS column and compare the Sum, Average, Min, Max, and Count. Notice how much was supposedly recovered. In calculating the total number of crude oil spills since 2010, I used the UNINTENTIONAL_RELEASE_BBLS column. In calculating the size of the spills, I used the NET_LOSS_BBLS, giving the operators the benefit of the doubt on the whole. (In my reporting, I found that many of these spills – especially smaller ones – leaked out into containment structures not unlike a "rain barrel in a bathtub" and could conceivably be cleaned up.)
Good data scientists, data visualizers and data journalists will all tell you that they tend to generate between 50 and 100 visualizations of their data before settling on the ones they're going to publish. Just like descriptive statistics, this exploratory data visualization is helpful in getting one's head around the data. For an idea of a quick graphic that is helpful in giving you context about the data, let's select and sort the "year" column A -> Z and then click "Chart" from the "Insert" tab on the menu. Choose "column chart" and select "Aggregate column R" at the bottom.
What are some other charts you could draw up?
Other questions you could ask of the data
Try asking yourself another question and answer it with this dataset. Try asking questions on:
- A commodity besides crude oil
- The riskiest pipeline operators
- Spills in specific states
Embedding your analysis in your story
Below are two places I embedded my statistical analysis of the PHMSA data. See if you can filter and sort the data to reach my conclusions.
"Since 2010, there have been more than 1,300 crude oil spills in the United States, according to data collected by the Pipeline and Hazardous Materials Safety Administration, a regulatory arm of the U.S. Department of Transportation: That’s one crude oil spill every other day."
"Of the 8.9 million gallons spilled since 2010, the agency has reported that over 70 percent, or 6.3 million gallons, has been recovered. Filtering PHMSA data to look at spills in onshore water crossings only, like rivers, however, the recovery rate drops to just 30 percent."
Mapping with Datawrapper
- Go to datawrapper.de and click "Create a Map." Then, click "Symbol maps."
- Search for "USA" and click "USA » US-States"
- Under add your data, click "import your dataset" button, click import dataset by latitude and longitude, and upload a CSV like this one of the top 20 crude spills by size.
- Next, adjust the column that will be mapped to the "size" of the symbols. Select "NET_LOSS_BBLS." Click "Proceed" at bottom of page.
- Click "Set symbol tooltip" to map some data to points that will be revealed once a user hovers over them. I've decided to put ONSHORE_COUNTY NAME and ONSHORE_STATE_ABBREVIATION in the title and NET_LOSS_BBLS in the body of the tooltip. You can add text, commas, colons and some basic HTML like a line break
to design the tooltip. Scroll down and click "Save."
- Click the "Annotate" tab and add a title like "Top 20 crude oil spills since 2010."
- Click "Publish" and on the next page click "Publish Chart." Copy the URL or embed code.
Mapping with Carto
Upload the CSV by connecting your dataset.
- After your dataset is loaded in, click "Create Map."
- No points should appear. Click "Geocode" and then define your parameters by selecting the "location_latitude" column for Latitude and "location_longitude" for the Longitude column.
- Click "Apply" and you should see points appear on your map.
- Click "Add new analysis" under the "Analysis" tab and select Filter.
- Select the column you want to filter by, in our case it's "commodity_released_type" and then show "crude oil."
- Next, unde the "Style" tab, select the "By value" button next to "Point size" and select the "net_loss_bbls" column.
You're now faced with the tough decision of binning your data's distribution when selecting the bubble size. Let's do a quick aside into data classification.
- For this workshop, I selected the "Equal Intervals" classification and a 4 to 40 ramp, highlighting only the largest spills. Change the color by selecting the "Point color" color bar. Change the point overlap to "xor" to achieve mine.
- Let's add in the shapefiles for the U.S.'s crude oil pipelines. Add the pipelines by clicking "Add New Layer" in the main layers screen. Those can be found in a zip from the U.S. Energy Information Administration. (We can also add a shapefile of the proposed DAPL route.)
To add interactivity to your map, play with the "widgets" function. Add a "histogram" pulled from one of your columns, like "net_loss_barrels," and a "category" from the "iyear" column.
Extra points if you can track down and plot the shapefile for the Standing Rock reservation.
Read the full story on Undark here.
Mapping nuclear data
Try mapping nuclear powerplants. Data from CarbonBrief.
Shapefiles of France's AOC regions.
A great list from UPenn on European GIS data.
A shapefile of European natural gas pipelines here.