-
Notifications
You must be signed in to change notification settings - Fork 0
/
Lab_02_data_exercises.Rmd
736 lines (568 loc) · 30.2 KB
/
Lab_02_data_exercises.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
---
title: 'EES 3310/5310 Lab #2'
author: "put your name here"
date: 'Lab: Mon. Sept. 3. Due: Mon. Sept. 10.'
fontfamily: "newtxtext,newtxmath"
fontsize: "12pt"
output:
html_document: default
pdf_document: default
subtitle: Exercises in Data Manipulation
---
```{r setup, include=FALSE}
knitr::knit_hooks$set(inline = function(x) { knitr:::format_sci(x, 'md')})
knitr::opts_chunk$set(echo = TRUE)
# This section loads necessary R libraries and sources scripts that define
# useful functions format_md.
#
data_dir = "_data"
script_dir = "scripts"
library(pacman)
p_load(zoo, xml2, tidyverse, stringr)
theme_set(theme_bw(base_size = 15))
source('_scripts/utils.R', chdir = T)
source('_scripts/modtran.R', chdir = T)
mlo_url = "http://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/monthly/monthly_in_situ_co2_mlo.csv"
giss_url = "https://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.csv"
# Create a data directory if one does not exist.
if (!dir.exists('_data')) dir.create('_data')
```
# Instructions
# Worked Example
## Downloading CO~2~ Data from Mauna Loa Observatory
In 1957, Charles David Keeling established a permanent observatory on Mauna Loa,
Hawaii to make continuous measurements of atmospheric carbon dioxide. The
observatory has been running ever since, and has the longest record of direct
measurements of atmospheric carbon dioxide levels. The location was chosen
because the winds blow over thousands of miles of open ocean before reaching
Mauna Loa, and this means the CO~2~ measurements are very pure and uncontaminated
by any local sources of pollution.
We can download the data from <`r mlo_url`>. We can download the file and save
it to the local computer using the R function `download.file`
```{r download_mlo_data, include=TRUE, message=FALSE}
mlo_url = "http://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/monthly/monthly_in_situ_co2_mlo.csv"
download.file(mlo_url, '_data/mlo_data.csv')
```
Try opening the data file in Excel or a text editor.
The first 54 lines of the data file are comments describing the data.
These comments describe the contents of each column of data
and explain that this data file uses the special value `-99.99` to
indicate a missing value. The Mauna Loa Observatory started recording data
in March 1958, so the monthly averages for January and February are missing.
Other months are missing for some months in the record when the instruments
that monitor CO~2~ concentrations were not working properly.
The `read_csv` function from the `tidyverse` package can read the data into R
and convert it into a `tibble` data structure (like a fancy data table).
However, when R reads in `.csv` files, it expects column names to be on a
single row,
and lines 55--57 of the data file are column headings that are split across
multiple rows, so R will get confused if we tell it to use those rows as column
names.
Thus, we will tell `read_csv` to read this data file, but skip the first 57 lines.
We will also tell it not to look for column names in the data file, so we will
supply the column names, and we will tell it that whenever it sees `-99.99`,
it should interpret that as indicating a missing value, rather than a measurement.
Finally, R can guess what kinds of data each column contains, but for this file,
things work a bit more smoothly if we provide this information explicitly.
`read_csv` lets us specify the data type for each column by providing a string
with one letter for each column. The letters are `i` for integer numbers,
`d` for real numbers (i.e., numbers with a decimal point and
fractional parts), `n` for an unspecified number,
`c` for character (text) data, `l` for logical (`TRUE` or `FALSE`),
`D` for calendar dates, `t` for time of day, and `T` for combined date and time.
```{r import_mlo_data, include=TRUE, message=FALSE}
mlo_data = read_csv('_data/mlo_data.csv',
skip = 57, # skip the first 57 rows
col_names = c('year', 'month', 'date.excel', 'date',
'co2.raw', 'co2.raw.seas',
'co2.fit', 'co2.fit.seas',
'co2.filled', 'co2.filled.seas'),
col_types = 'iiiddddddd', # the first three columns
# are integers and the
# next 7 are real numbers
na = '-99.99' # interpret -99.99 as a missing value
)
```
Let's look at the first few rows of the data:
Here is how it looks in R:
```{r show_data_r, include = TRUE}
head(mlo_data)
```
And here is how we can use the `kable` function to format the data
nicely as a table in an RMarkdown document:
```{r show_data, include=TRUE, results = "asis"}
head(mlo_data) %>% knitr::kable()
```
There are six different columns for the CO~2~ measurements:
* `co2.raw` is the
raw measurement from the instrument. The measurements began in March 1958, so
there are `NA` values for January and February. In addition, there are missing
values for some months when the instrument was not working well.
* `co2.fit` is a smoothed version of the data, which we will not use in this lab.
* `co2.filled` is the same as `co2.raw`, except that where there are missing
values in the middle of the data, they have been filled in with interpolated
estimates based on measurements before and after the gap.
For each of these three data series, there is also a _seasonally adjusted_
version, which attempts to remove the effects of seasonal variation in order
to make it easier to observe the trends.
For this lab, we will focus on the `co2.filled` data series. To keep things simple,
we can use the `select` function from `tidyverse` to keep only certain columns
in the tibble and get rid of the ones we don't want.
```{r simplify_mlo_data, include = TRUE}
mlo_simple = mlo_data %>% select(year, month, date, co2 = co2.filled)
head(mlo_simple)
```
Note how we renamed the `co2.filled` column to just plain `co2` in the select
function.
Now, let's plot this data:
```{r plot_mlo, include = TRUE, warning = FALSE, fig.show = "hold"}
ggplot(mlo_simple,
aes(x = date, y = co2)) + # This line specifies the data to plot
# and the aesthetics that define which
# variables to use for the x and y
# axes.
geom_line() + # This line says to plot lines between the points
labs(x = "Year", y = "CO2 concentration (ppm)",
title = "Measured CO2 from Mauna Loa Observatory")
# ^^^ The previous line gives the names of the axes and the title
# for the plot.
# Earlier in this .Rmd file, I called set_theme(theme_bw(base_size = 15))
# to set the default plot style. If you call ggplot() without this,
# you will get a different style, but you can either call theme_set
# or you can add a theme specification (such as
# "+ theme_bw(base_size = 15)")
# to the end of the sequence of plotting commands in order to
# apply a specific style to an individual plot.
```
Notice the seasonal variation in CO~2~. Every year, there is a large cycle of
CO~2~, but underneath is a gradual and steady increase from year to year.
If we wanted to look at the trend without the seasonal variation, we could use
the `co2.filled.seas` column of the original tibble, but instead, let's look at
how we might estimate this ourselves.
The seasonal cycle is 12 months long and it repeats every year. This means that
if we average the values in our table over a whole year, this cycle should average
out. We can do this by creating a new column `annual` where every row represents
the average over a year centered at that row (technically, all the months from
5 months before through six months after that date).
To do this, we use the function `rollapply` from the `zoo` package, as shown
below. The `rollapply` function allows you to take a series
of data (such as monthly CO~2~ measurements) and at each point, apply a function
to the data within a "window" that includes a certain number of points
before and after the point in question.
Here, we apply the `mean` function (`FUN = mean`) to take the average, and we
define the "window" to be the 12 points centered on the point in question.
This means that for each month in our data series, `rollapply` takes the average
of the 12 measurements centered on that month (technically, the month, the five
months before, and the six months after). You could also specify
`align = "left"` to take the 12 months starting with the given month,
or `align = "right"` to take the 12 months ending with the given month.
We also need to tell `rollapply` what to do with the months at the beginning
of the series that don't have five months of data before them
and points at the end of the series that don't have six months after them.
`fill = NA` means to set those points to `NA`, which is a special value R uses
to indicate missing values (`NA` means "not available").
```{r plot_mlo_annual, include = TRUE, warning = FALSE}
mlo_simple %>% mutate(annual = rollapply(data = co2, width = 12, FUN = mean,
fill = NA, align = "center")) %>%
ggplot(aes(x = date)) +
geom_line(aes(y = co2), color = "dark blue") +
geom_line(aes(y = annual), color = "black", size = 2) +
labs(x = "Year", y = "CO2 concentration (ppm)",
title = "Measured and Seasonally Adjusted CO2")
```
But wait: we might want a legend to tell the reader what each colored line
represents. We can create new aesthetics for the graph mapping to do this:
```{r plot_mlo_annual_2, include = TRUE, warning=FALSE}
mlo_simple %>% mutate(annual = rollapply(data = co2, width = 12, FUN = mean,
fill = NA, align = "center")) %>%
ggplot(aes(x = date)) +
geom_line(aes(y = co2, color = "Raw")) +
geom_line(aes(y = annual, color = "12-month average"), size = 2) +
scale_color_manual(values = c("Raw" = "dark blue", "12-month average" = "black"),
name = "Smoothing") +
labs(x = "Year", y = "CO2 concentration (ppm)",
title = "Measured and Seasonally Adjusted CO2")
```
We can also anlyze this data to estimate the average trend in CO~2~.
We use the `lm` function in R to fit a straight line to the data,
and we use the `tidy` function from the `broom` package to
print the results of the fit nicely.
R has many powerful functions to fit data, but here we will just use a very
simple one. We specify the linear relationship to fit using R's formula
language. If we want to tell R that we think `co2` is related to `date`
by the linear relationship $co2 = a + b \times \text{date}$, then we write
the formula `co2 ~ date`. The intercept is implicit, so we don't have to spell
it out.
```{r calc_mlo_trend, include = TRUE}
trend = lm(co2 ~ date, data = mlo_simple)
library(broom)
tidy(trend)
```
This shows us that the trend is for CO~2~ to rise by
`r round(summary(trend)$coefficients['date','Estimate'],2)` ppm per year,
with an uncertainty of plus or minus
`r signif(summary(trend)$coefficients['date','Std. Error'], 1)`.
We can also plot a linear trend together with the data:
```{r plot_mlo_with_fitted_trend, include = TRUE, warning=FALSE}
mlo_simple %>%
ggplot(aes(x = date, y = co2)) +
geom_line() +
geom_smooth(method = 'lm') +
labs(x = "Year", y = "CO2 concentration (ppm)",
title = "Measured CO2 and Linear Fit")
```
# Exercises
## Exercises with CO~2~ Data from the Mauna Loa Observatory
Using the `select` function, make a new data tibble called `mlo_seas`, from
the original `mlo_data`, which only has two columns: `date` and
`co2.seas`, where `co2.seas` is a renamed version of `co2.filled.seas` from the
original tibble.
```{r make_mlo_seas, include=TRUE}
# put your R code here
```
Now plot this with `co2.seas` on the _y_ axis and `date` on the _x_ axis,
and a linear fit:
```{r plot_mlo_seas, include = TRUE}
# put your R code here
# remember to use geom_smooth to include a linear fit.
```
Now fit a linear function to find the annual trend of `co2.seas`. Save
the results of your fit in a variable called `trend.seas`.
```{r fit_mlo_sease, include=TRUE}
# put your R code here to set trend.seas
```
Compare the trend you fit to the raw `co2.filled` data to the trend you fit
to the seasonally adjusted data.
## Exercises with Global Temperature Data from NASA
We can also download a data set from NASA's Goddard Institute for Space Studies
(GISS), which contains the average global temperature from 1880 through the
present.
The URL for the data file is
<`r giss_url`>
Download this file and save it in the directory `_data/global_temp_land_sea.csv`.
```{r download_giss_temp, include=TRUE}
# Put your R code here
```
* Open the file in Excel or a text editor and look at it.
* Unlike the CO~2~ data file, this one has a single line with the
data column names, so you can specify `col_names=TRUE` in `read_csv`
instead of having to write the column names manually.
* How many lines do you have to tell `read_csv` to skip?
* `read_csv` can automatically figure out the data types for each column,
so you don't have to specify `col_types` when you call `read_csv`
* This file uses `***` to indicate missing values instead of `-99.99`, so you
will need to specify `na="***"` in `read_csv`.
For future reference,
if you have a file that uses multiple different values to indicate missing
values, you can give a vector of values to `na` in `read_csv`:
`na = c('***','-99.99', 'NA', '')` would tell `read_csv` that if it finds
any of the values "***", "-99.99", "NA", or just a blank with nothing in it,
any of those would correspond to a missing value, and should be indicated by
`NA` in R.
Now read the file into R, using the `read_csv` function, and assign
the resulting tibble to a variable `giss_temp`
```{r read_giss_temp, include=TRUE}
# Put your R code here to call read_csv and read "global_temp_land_sea.csv"
# show the first 5 lines of giss_temp
# head(giss_temp, 5)
```
Something is funny here: Each row corresponds to a year, but there are columns
for each month, and some extra columns called "J-D", "D-N", "DJF", "MAM", "JJA",
and "SON". These stand for average values for the year from January through
December, the year from the previous December through November, and the seasonal
averages for Winter (December, January, and February),
Spring (March, April, and May), Summer (June, July, and August), and Fall
(September, October, and November).
The temperatures are recorded not as the thermometer reading, but as _anomalies_.
If we want to compare how temperatures are changing in different seasons and at
different parts of the world, raw temperature measurements are hard to work with
because summer is hotter than winter and Texas is hotter than Alaska, so it
becomes difficult to compare temperatures in August to temperatures in January,
or temperatures in Texas to temperatures in Alaska
and tell whether there was warming.
To make it easier and more reliable to compare temperatures at different times
and places, we define anomalies: The temperature anomaly is the difference between
the temperature recorded at a certain location during a certain month and
a baseline reference value, which is the average temperature for that month
and location over a period that is typically 30 years.
The GISS temperature data uses a baseline reference period of 1951--1980, so
for instance, the temperature anomaly for Nashville in July 2017 would be
the monthly average temperature measured in Nashville during July 2017 minus
the average of all July temperatures measured in Nashville from 1951--1980.
The GISS temperature data file then averages the temperature anomalies over all
the temperature-measuring stations around the world and reports a global average
anomaly for every month from January 1880 through the latest measurements
available (currently, July 2017).
Let's focus on the months only. Use `select` to select just the columns for
"Year" and January through December (if you are selecting a consecutive range
of columns between "Foo" and "Bar", you can call `select(Foo:Bar)`).
Save the result in a variable called `giss_monthly`
```{r make_giss_monthly, include=TRUE}
# put your R code here
```
Next, it will be difficult to plot all of the data if the months are organized as
columns. What we want is to transform the data tibble into one with three columns:
"year", "month", and "anomaly". We can do this easily using the `gather` function
from the `tidyverse` package: `gather(df, key = month, value = anomaly, -Year)`
or `df %>% gather(key = month, value = anomaly, -Year)` will gather all of the
columns except `Year` (the minus sign in `select` or `gather` means to include
all columns except the ones indicated with a minus sign) and:
* Make a new tibble with three columns: "Year", "month", and "anomaly"
* For each row in the original tibble, make rows in the new tibble for each of
the columns "Jan" through "Dec", putting the name of the column in "month"
and the anomaly in "anomaly".
Here is an example of using `gather`, using the built-in data set `presidents`,
which lists the quarterly approval ratings for U.S. presidents from 1945--1974:
```{r gather_example_prep, include=TRUE}
df = presidents@.Data %>% matrix(ncol=4, byrow = TRUE) %>%
as_tibble() %>% set_names(paste0("Q", 1:4)) %>% mutate(year = 1944 + seq(n()))
print("First 10 rows of df are")
print(head(df, 10))
```
For each year, the table has a column for the year and
four columns (Q1 ... Q4) that hold the quarterly
approval ratings for the president in that quarter.
Now we want to gather these data into three columns:
one column for the year, one column to indicate the quarter,
and one column to indicate the approval rating.
We do this with the `gather` function from the `tidyverse` package.
```{ r gather_example}
dfg = df %>% gather(key = quarter, # create a column called "quarter" to store
# the names of the columns that are gathered
value = approval, # create a column called "approval" to
# store the values from those columns
# (i.e., the approval ratings in that quarter)
-year # the minus sign means gather all columns EXCEPT year.
) %>%
arrange(year, quarter) # sort the rows of the resulting tibble to put
# the years in ascending order, from 1945 to 1971
# and within each year, sort the quarters from Q1
# to Q4
head(dfg) # print the first few rows of the tibble.
```
Now you try to do the same thing to:
* First select just the columns of `giss_monthly` for the year and the
individual months.
* Next, gather all the months togeher, so there will be three columns:
one for the year, one for the name of the month, and one for the
temperature anomaly in that month.
* Store the result in a new variable called `giss_g`
```{r gather_giss, include=TRUE}
# put your R code here
```
Remember how the CO~2~ data had a column `date` that had a year plus a fraction
that corresponded to the month, so June 1960 was 1960.4548?
Here is a trick that lets us do the same for the `giss_g` data set.
R has a data type called `factor` that it uses for managing categorical data,
such as male versus female, Democrat versus Republican, and so on.
Categorical factors have a textual label, but are silently represented as integer
numbers. Normal factors don't have a special order, so R sorts the values alphabetically.
However, there is another kind of factor called an ordered factor, which allows
us to specify the order of the values.
We can use a built-in R variable called `month.abb`, which is a vector of
abbreviations for months.
The following command will convert the `month` column in `giss_g` into an
ordered factor that uses the integer values 1, 2, ..., 12 to stand for
"Jan", "Feb", ..., "Dec", and then uses those integer values to create a new
column, `date` that holds the fractional year, just as the `date` column in
`mlo_data` did:
```
giss_g = giss_g %>%
mutate(month = ordered(month, levels = month.abb),
date = Year + (as.integer(month) - 0.5) / 12) %>%
arrange(date)`
```
In the code above, `ordered(month, levels = month.abb)` converts the variable
`month` from a character (text) variable that contains the name of the month
to an ordered factor that associates a number with each month name, such that
'Jan' = 1 and 'Dec' = 12.
Then we create a new column called `date` to get the fractional year corresponding
to that month. We have to explicitly convert the ordered factor into a number
using the function `as.integer()`, and we subtract 0.5 because the time that
corresponds to the average temperature for the month is the middle of the month.
Below, use code similar to what I put above to add a new `date` column to
`giss_g`.
```{r add_date_to_giss_g, include=TRUE}
# put your R code here
```
Now plot the monthly temperature anomalies versus date:
```{r plot_giss, include=TRUE, warning = FALSE}
# put your R code here
# ggplot( ) + # specify the data and the mappings of variables to plot aesthetics
# geom_xxx() + # put the geometries (lines, points, etc) that you want to plot
# labs() + # label the axes
# ... # put any other characteristics here.
```
That plot probably doesn't look like much, because it's very noisy.
Use the function `rollapply` from the package `zoo` to create
new columns in `giss_g` with
12-month and 10-year (i.e., 120-month) rolling averages of the
anomalies.
Make a new plot in which you plot a thin blue line for the monthly anomaly
(use `geom_line(aes(y = anomaly), color = "blue", alpha = 0.3, size = 0.1)`;
alpha is an optional specification for transparency where 0 means invisible
(completely transparent) and 1 means opaque),
a medium dark green line for the one-year rolling average,
and a thick dark blue line for the ten-year rolling average.
```{r plot_giss_with_smoothing, include=TRUE, warning = FALSE}
# put your R code here
# giss_g %>%
# mutate( ... ) %>% # put the code here to add a columns called
# # "smooth.1" for the one-year smoothed anomaly and
# # "smooth.10" for the ten-year smoothed anomaly
# ggplot(aes( ... )) + # Put code here to map variables to aesthetics
# geom_line(aes(y = anomaly), alpha = 0.3, size = 0.1) + # plot a thin blue line
# # with the un-smoothed anomaly
# geom_line(...) + # Now add a medium "dark green" line for the one-year smoothed data
# geom_line(...) + # And a thick "dark blue" line for the ten-year smoothed data
# labs( ...) + # Label your axes
# ... # add any other plot specifications you need.
```
The graph shows that temperature didn't show a steady trend until starting around
1970, so we want to isolate the data starting in 1970 and fit a linear trend
to it.
To select only rows of a tibble that match a condition, we use the function
`filter` from the `tidyverse` package:
`data_subset = df %>% filter( conditions )`, where `df` is your original tibble
and `conditions` stands for whatever conditions you want to apply.
You can make a simple condition using equalities or inequalities:
* `data_subset = df %>% filter( month == "Jan")` to select all rows where the
month is "Jan"
* `data_subset = df %>% filter( month != "Aug")` to select all rows where the
month is not August.
* `data_subset = df %>% filter( month %in% c("Sep", "Oct", "Nov")` to select all
rows where the month is one of "Sep", "Oct", or "Nov".
* `data_subset = df %>% filter(year >= 1945)` to select all rows where the year
is greater than or equal to 1945.
* `data_subset = df %>% filter(year >= 1951 & year <= 1980 )` to select all rows
where the year is between 1951 and 1980, inclusive.
* `data_subset = df %>% filter(year >= 1951 | month == "Mar")` to select all rows
where the year is greater than or equal to 1951 or the month is "Mar".
this will give all rows from January 1951 onward, plus all rows before 1951
where the month is March.
Below, create a new variable `giss_recent` and assign it a subset of `giss_g` that
has all the data from January 1970 through the present. Fit a linear trend to
the monthly anomaly and report it.
What is the average change in temperature from one year to the next?
```{r giss_trend, include=TRUE}
# put your R code here
# create giss_recent
# fit a linear trend to the data using lm()
# print the coefficients of the trend.
```
### Did Global Warming Stop after 1998?
It is a common skeptic talking point that global warming stopped in 1998.
In years with strong El Niños, global temperatures tend to be higher
and in years with strong La Niñas, global temperatures tend to be lower.
We will discuss why later in the semester.
The year 1998 had a particularly strong El Niño, and the year set a record
for global temperature that was not exceeded for several years. Indeed, compared
to 1998, it might look as though global warming paused for many years.
We will examine whether this apparent pause has scientific validity.
To begin with, we will take the monthly GISS temperature data and convert it to
annual average temperatures, so we can deal with discrete years, rather than
separate temperatures for each month.
We do this with the `group_by` and `summarize` functions.
We also want to select only recent data, so we arbitrarily say we will look at
temperatures starting in 1979, which gives us 19 years before the 1998
El Ni˜o.
We don't have a full year of data for 2017, so we want to discard that because we
won't get a full year average from it.
If we go back to the original `giss_g` data tibble, run the following code:
```{r summarize_giss, include=TRUE}
# When you are ready to run the code below, you can un-comment it in RStudio
# by highlighting the lines you want to uncomment and pressing
# Ctrl + Shift + C on Windows or Cmd + Shift + C on Mac.
# (You can also comment out code using the same key combination: it works
# as a toggle between commented and un-commented).
#
# giss_annual = giss_g %>%
# filter(year >= 1979 & year < 2017) %>%
# group_by(year) %>%
# summarize(anomaly = mean(anomaly)) %>%
# ungroup() %>%
# mutate(date = year + 0.5)
```
This code groups the giss data by the year, so that one group will have
January--December 1979, another will have January--December 1980, and
so forth.
Then we replace the groups of 12 rows for each year (each row represents one month)
with a single row that represents the average of those 12 months.
It is important to tell R to `ungroup` the data after we're done working with
the groups.
Finally, we set `date` to `year + 0.5` because the average of a year corresponds
to the middle of the year, not the beginning.
Now, let's introduce a new column `after`, which indicates whether the data is
after the 1998 El Niño:
Now plot the data and color the points for 1998 and afterward dark red to
help us compare before and after 1998.
```{r plot_recent_giss, include=TRUE, warning = FALSE}
# ggplot(giss_annual, aes(x = date, y)) +
# geom_line(size = 1) +
# geom_point(aes(color = year >= 1998), size = 1) +
# scale_color_manual(values = c(TRUE = "dark red", FALSE = "dark blue"),
# guide = "none") + # color "before" points dark blue,
# # "after" points dark red
# # don't use a legend
# labs(x = "Year", y = "Temperature Anomaly")
```
Does it look as though the red points are not rising as fast as the blue points?
Let's just plot the data from 1998 on:
```{r plot_pause, include=TRUE}
# Put your R code here
# Filter the giss_annual data to select only the years >= 1998
# plot the data
```
Now how does it look?
Let's use the `filter` function to break the data into two different tibbles:
`giss_before` will have the data from 1979--1998 and the other, `giss_after`
will have the data from 1998 onward (note that the year 1998 appears in both
tibbles).
```{r split_giss_data, include=TRUE}
# Put your R code here
```
Now use `lm` to fit a linear trend to the temperature data in `giss_before`
(from 1979--1998) and assign it to a variable `giss_trend`.
Next, add a column `timing` to each of the split data sets and set
the value of this column to "Before" for `giss_before` and "After" for
`giss_after`.
```{r add_timing, include=TRUE}
# Put your R code here
```
Now, combine the two tibbles into one tibble:
```{r combine_giss, include=TRUE, warning = FALSE}
# giss_combined = bind_rows(giss_before, giss_after)
```
Now let's use ggplot to plot `giss_combined`:
* Aesthetic mapping:
* Use the `date` column for the _x_ variable.
* Use the `anomaly` column for the _y_ variable.
* Use the `timing` column to set the color of plot elements
* Plot both lines and points.
* Set the `size` of the lines to 1
* Set the `size` of the points to 2
* Use the `scale_color_manual` function to set the color of
"Before" to "dark blue" and "After" to "dark red"
* Use `geom_smooth(data = giss_before, method="lm",
color = "blue", fill = "blue", alpha = 0.2, fullrange = TRUE)`
to show a linear trend that is fit just to the `giss_before` data.
```{r plot_combined, include = TRUE}
# Put your R code here.
```
Try this with the parameter `fullrange` set to `TRUE` and `FALSE` in the
`geom_smooth` function. What is the difference?
What this plot shows is the full data set, and a linear trend that is fit just
to the "before" data. The trend line shows both the best fit for a trend
(that's the solid line) and the range of uncertainty in the fit (that's the
light blue shaded area around the line).
If the temperature trend changed after 1998 (e.g., if the warming paused, or if
it reversed and started cooling) then we would expect the temperature measurements
after 1998 to fall predominantly below the extrapolated trend line, and our
confidence that the trend had changed would depend on the number of points that
fall below the shaded uncertainty range.
How many of the red points fall below the trend line?
**Answer:** _put your answer here._
How many of the red points fall above the trend line?
**Answer:** _put your answer here._
What do you conclude about whether global warming paused or stopped after 1998?
**Answer:** _put your answer here._