forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
108 lines (79 loc) · 3.46 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
## Loading and preprocessing the data
The data is unzipped, loaded, and reshaped using the melt function in the reshape2 package.
```{r, echo=TRUE}
library(reshape2)
unzip("activity.zip")
activity <- read.csv("activity.csv")
act <- melt(activity, id.vars=c("date", "interval"))
head(activity)
head(act)
```
## What is mean total number of steps taken per day?
Ignoring missing values, a histogram of total number of steps taken per day is plotted:
```{r hist1, echo=TRUE}
actsum <- dcast(act, date ~ variable, fun.aggregate = sum, na.rm=TRUE)
head(actsum)
with(actsum, hist(steps, breaks=20))
```
The mean and median values of the total number of steps taken per day is calculated:
```{r, echo=TRUE}
mean(actsum$steps)
median(actsum$steps)
```
## What is the average daily activity pattern?
Still ignoring missing values, the average number of steps taken in a five-minute interval (averaged over all days) is plotted as a time series:
```{r avsteps1, echo=TRUE}
actint <- dcast(act, interval ~ variable, fun.aggregate = mean, na.rm=TRUE)
head(actint)
with(data=actint, plot(x=interval, y=steps, type="l", main="Average steps per interval"))
```
The 5-minute interval with the maximum number of steps, averaged across all days, is found here:
```{r, echo=TRUE}
actint[which.max(actint$steps),]
```
## Imputing missing values
The total number of missing steps values is calculated:
```{r, echo=TRUE}
sum(is.na(act$value))
```
To impute the missing number of steps for a given time interval, the average value of steps for that time interval (averaged over all days) is used:
```{r, echo=TRUE}
filled <- act
for (i in 1:length(filled$value)) {
if (is.na(filled[i,]$value)) {
int = filled[i,]$interval
filled[i,]$value <- actint[actint$interval==int,]$steps
}
}
actsum_filled <- dcast(filled, date ~ variable, fun.aggregate = sum, na.rm=TRUE)
head(actsum_filled)
```
The histogram for total number steps per day (with missing values imputed) is plotted and the mean and median values are calculated:
```{r hist2, echo=TRUE}
with(actsum_filled, hist(steps, breaks=20, main="Histogram of steps (with imputed missing values)"))
mean(actsum_filled$steps)
median(actsum_filled$steps)
```
Imputing values decreases the frequency of 0 values and increases the frequency of values at the peak around 10,000, as seen from a comparison of histograms. This increases the values for mean and median as well.
## Are there differences in activity patterns between weekdays and weekends?
A new factor variable is created that indicates whether a given date is a weekday or weekend day, and the data is reshaped:
```{r, echo=TRUE}
filled$day <- ifelse(weekdays(as.Date(filled$date)) %in% c("Saturday", "Sunday"),
"weekend", "weekday")
weekdaysteps <- dcast(filled, interval ~ day, fun.aggregate = mean)
head(weekdaysteps)
weekdaymelt <- melt(weekdaysteps, id.vars = "interval")
head(weekdaymelt)
```
Finally, the average number of steps taken during a given 5-minute interval, averaged over all weekend days (top) or over all weekdays (bottom) are plotted as time series plots:
```{r comparison, echo=TRUE}
library(lattice)
xyplot(value~interval | variable, weekdaymelt, type="l", layout=c(1,2), xlab="Interval", ylab="Number of Steps", main="Average steps per interval on weekdays or weekend days")
```
The plots demonstrate that there are different patterns on weekdays and weekends, on average.