-
Notifications
You must be signed in to change notification settings - Fork 3
/
literate.Rmd
130 lines (90 loc) · 5.99 KB
/
literate.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
title: Literate programming
weight: 2
---
> Instead of imagining that our main task is to instruct a *computer* what to do, let us concentrate rather on explaining to *human beings* what we want a computer to do.
>
> -- Donald E. Knuth
Literate programming was introduced by Donald Knuth in [a 1984 article](https://doi.org/10.1093/comjnl/27.2.97) featuring the quote above.
In practice, literate programming involves writing source code that can be processed to produce
1. a document explaining what the program does,
1. a program that can be executed.
In this course, we focus on literate programming in R using **RMarkdown** ( [website](https://rmarkdown.rstudio.com/), [book](https://bookdown.org/yihui/rmarkdown/), [code](https://github.com/rstudio/rmarkdown) ). In this specific approach to literate programming, one *knits* an RMarkdown file into an HTML or PDF file, and code *chunks* can be executed and displayed in the output.
## Advantages
- Code and documentation are consistent with each other, since they are produced from the same source.
- One can focus on readability of the output document.
- One can emphasize thought processes behind the code more clearly.
## How to use RMarkdown
If you already are familiar with a flavor of [Markdown](https://en.wikipedia.org/wiki/Markdown), RMarkdown primarily extends this with the ability to write and call R code.
[This Markdown cheat sheet](https://www.markdownguide.org/cheat-sheet/) is pretty self-explanatory. One can include R code chunks by inserting lines such as the following in an RMarkdown file.
````
```{r[, <any chunk options>]}`r ''`
# R code
```
````
For example, writing
````
```{r}`r ''`
1 + 2
```
````
gives rise to
```{r}
1 + 2
```
in the rendered document. Many of the webpages for this website are written using R Markdown, so you can view their source to see further examples. In RStudio, documents are rendered by clicking the "Knit" button.
The first few [lessons](https://rmarkdown.rstudio.com/lesson-1.html) provided by RStudio may be helpful. There is an [RMarkdown cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf) as well.
## Application to data science
In many data science workflows, one uses a variety of functions to load and clean data, analyze it, and then display outputs and visualizations. A literate programming approach to these high-level tasks makes them more transparent and reproducible. Furthermore, the document can be designed with the intended audience in mind.
## Example
This example concerns historical exchange rate data against the Euro, [obtained from the European Central Bank](https://www.ecb.europa.eu/stats/policy_and_exchange_rates/euro_reference_exchange_rates/html/index.en.html).
This page is written in RMarkdown: you can view the source by clicking the "View source" link at the bottom of the page.
```{r}
eurofxref <- read.csv("../../static/data/eurofxref-hist.csv", na.strings="N/A")
eurofxref$Date <- as.Date(eurofxref$Date)
n <- length(eurofxref$Date)
```
We can plot the USD and GBP rates over time.
```{r}
ylim <- range(c(eurofxref$USD, eurofxref$GBP))
plot(eurofxref$Date, eurofxref$USD, pch=20, ylim=ylim,
main="Daily exchange rate for USD (black) and GBP (red) against the Euro",
xlab="Date", ylab="Price of 1 Euro")
points(eurofxref$Date, eurofxref$GBP, pch=20, col="red")
```
We consider the daily change in exchange rates for both currencies.
```{r}
diff.USD <- diff(eurofxref$USD)
diff.GBP <- diff(eurofxref$GBP)
plot(eurofxref$Date[1:(n-1)], diff.USD, pch=20,
main="Daily difference in exchange rate for USD (black) and GBP (red)
against the Euro", xlab="Date", ylab="Difference in price of 1 Euro")
points(eurofxref$Date[1:(n-1)], diff.GBP, pch=20, col="red")
```
The differences may be independent normal random variables. We can plot kernel density estimates for rescaled differences and an appropriate normal density to check this.
```{r}
rescaled.diff.USD <- diff.USD/sd(diff.USD)
rescaled.diff.GBP <- diff.GBP/sd(diff.GBP)
vs <- seq(-10,10,0.01)
plot(density(rescaled.diff.USD), main="Kernel density estimates for USD (black)
and GBP (red), with standard normal density (blue)")
lines(density(rescaled.diff.GBP), col="red")
lines(vs, dnorm(vs), col="blue")
```
The estimated densities for the rescaled differences appear to be similar to each other, but different from a standard normal.
We can perform a Shapiro--Wilk test for each of the rescaled differences to see if they could be normal.
```{r}
# shapiro.test sample size must be between 3 and 5000
shapiro.test(rescaled.diff.USD[1:5000])
shapiro.test(rescaled.diff.GBP[1:5000])
```
There is very strong evidence against the rescaled differences being normally distributed.
We can also perform a Kolmogorov--Smirnov test to see if the rescaled differences could be from the same distribution.
```{r}
ks.test(rescaled.diff.USD, rescaled.diff.GBP)
```
The p-value is not that small, so evidence against the two rescaled differences coming from the same distribution is weak.
## When not to use RMarkdown
As suggested above, literate programming is very good for presenting a high-level data analysis. It is often used to communicate how data is processed and how conclusions are drawn.
It is not common to write packages of low-level functionality entirely using RMarkdown. Instead, people typically write R scripts with helpful comments that are easy to understand, and use function and variable names that ease understanding of the code itself. It is possible to use special "roxygen" comments that can automatically be turned into help documentation for users of those functions: we will cover this in more detail later.
An RMarkdown file has Markdown text and specially delimited code chunks, which can be processed to produce the desired output. [An alternative](https://yihui.name/knitr/demo/stitch/) is to write an R script with specially delimited Markdown chunks. These chunks are automatically ignored by R as they are parsed as comments, but can optionally be converted into a literate programming document by the `knitr::spin` function.