-
Notifications
You must be signed in to change notification settings - Fork 3
/
data.Rmd
186 lines (125 loc) · 5.59 KB
/
data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
title: Data
weight: 3
output:
blogdown::html_page:
toc: true
---
This page introduces a few basics ways to deal with data in R.
## Creating synthetic data
R has many functions for simulating [random variates](https://en.wikipedia.org/wiki/Random_variate) from common distributions, as well as evaluating their density / mass functions, distribution functions and quantile functions. For example, the Uniform(0,1) distribution is served by the functions
- `runif(n)`: simulates `n` random variates,
- `dunif(x)`: evaluates the density at `x`,
- `punif(x)`: evaluates the distribution function at `x`,
- `qunif(u)`: evaluates the quantile function at `u`.
Distributions supported in this way can be found by typing `?distributions` in the R console.
We can simulate 2 sets of 100 standard normal random variates as follows.
```{r, echo=FALSE}
set.seed(12345)
```
```{r}
n <- 100
x1s <- rnorm(n)
x2s <- rnorm(n)
```
Let us imagine that these are predictors in a linear regression model. We can simulate synthetic responses, and plot them alongside `x1s` and `x2s`.
```{r}
beta.true <- c(3.2, 0.5, -0.2)
sigmaSq.true <- 2
ys <- beta.true[1] + beta.true[2]*x1s + beta.true[3]*x2s + rnorm(n, mean=0, sd=sqrt(sigmaSq.true))
```
```{r}
plot(ys, pch=20, ylim=range(c(ys, x1s, x2s)))
points(x1s, pch=20, col="red")
points(x2s, pch=20, col="blue")
```
## Creating a data frame
A fundamental data structure in R is the data frame. We can create a data frame to hold our synthetic data, and then print the first 10 elements.
```{r}
my.data <- data.frame(x1=x1s, x2=x2s, y=ys)
head(my.data, 10)
```
Each row in a data frame corresponds to an observation of each the relevant variables, in this case an observation of `x1`, `x2` and `y`.
## Saving and loading CSV data
A simple way to store small datasets is using a [comma-separated values](https://en.wikipedia.org/wiki/Comma-separated_values) (CSV) format. In R, we can save our `my.data` dataset using the `write.csv` command.
```{r}
write.csv(my.data, file="synthetic-data.csv", row.names=FALSE)
```
The filename is relative to the current working directory, which can be printed by using the `getwd` command.
```{r}
getwd()
```
To change the working directory, one can use the `setwd` command.
To load data from a CSV file, one can use the `read.csv` command.
```{r}
my.data <- read.csv(file="synthetic-data.csv")
head(my.data)
```
Using a CSV file has the benefit that the data is in a standard format that can be easily parsed in any programming language that can read from a text file. It can also be viewed in a text editor or spreadsheet program.
## Saving and loading in R's data format
R has support for saving and loading using an R-specific binary format. In combination with compression (the default), this can substantially reduce large file sizes and speed up the saving/loading process. R data files are usually given the ".Rdata" extension.
```{r}
save(my.data, file="synthetic-data.Rdata")
```
Any variable or R function can be saved, and not just data frames. Multiple objects can be saved using a command such as
```r
save(object1, object2, object3, file="filename.Rdata")
```
The `save.image` command saves all of the variables in the workspace.
Unlike `write.csv` and `read.csv`, the name of a saved variable is associated with the saved value. Loading the .Rdata file associates the same variable name to the same data.
```{r}
load(file="synthetic-data.Rdata")
```
## Performing a simple linear regression
With our loaded `my.data`, we can use some of R's built-in statistical methods to try to estimate a linear model for `y` given `x1` and `x2`.
```{r}
model <- lm(y ~ x1 + x2, my.data)
summary(model)
```
The linear model says that data point `y[i]` is a realization of a normal random variable
$$Y^{(i)} = \beta_0 + \beta_1 x_1^{(i)} + \beta_2 x_2^{(i)} + \sigma Z^{(i)},$$
where $Z^{(1)},\ldots,Z^{(n)}$ is a sequence of independent standard normal random variables.
In particular, the intercept $\beta_0$ is included by default.
We can compare the "true" values of the coefficients with the fitted values.
```{r}
beta.true
model$coefficients
```
We can similarly compare the "true" variance of the errors with the unbiased estimate.
```{r}
sigmaSq.true
summary(model)$sigma^2
```
#### Computing the estimates directly
The estimates obtained in the linear regression example are not difficult to obtain: the `lm` function is useful because it can be used easily and has been tested extensively.
First we create the design matrix
```{r}
X <- cbind(rep(1, n), my.data$x1, my.data$x2)
```
We can check that this is the same as the design matrix used by `lm`, via the `model.matrix` function.
```{r}
all(X == model.matrix(model))
```
The maximum likelihood estimate of the coefficient vector $\beta$ is
$$\hat{\beta}_{\rm ML} = (X^{\rm T}X)^{-1}X^{\rm T}y,$$
which is straightforward to compute.
```{r}
beta.hat <- solve(t(X)%*%X, t(X)%*%my.data$y)
beta.hat
```
A prediction of $y$ given $x = (1,x_1,x_2)$ is
$$f(x) := \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2,$$
and the residual sum of squares is
$${\rm RSS} = \sum_{i=1}^n (y^{(i)} - f(x^{(i)}))^2.$$
The maximum likelihood estimate of $\sigma^2$ is
$$\hat{\sigma}^2_{\rm ML} = \frac{1}{n} {\rm RSS},$$
The ML estimate $\hat{\sigma}^2_{\rm ML}$ is not what was reported above. Instead, what is reported by `summary(model)$sigma^2` is the unbiased estimate
$$\hat{\sigma}^2 = \frac{1}{n-3} {\rm RSS}.$$
```{r}
RSS <- sum((my.data$y - X%*%beta.hat)^2)
sigmaSq.MLE <- RSS / n
sigmaSq.hat <- RSS / (n-3)
sigmaSq.MLE
sigmaSq.hat
```
The estimator is unbiased because the distribution of ${\rm RSS}/\sigma^2$ is $\chi^2_{n-3}$.