The data is unzipped, loaded, and reshaped using the melt function in the reshape2 package.
library(reshape2)
unzip("activity.zip")
activity <- read.csv("activity.csv")
act <- melt(activity, id.vars=c("date", "interval"))
head(activity)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
head(act)
## date interval variable value
## 1 2012-10-01 0 steps NA
## 2 2012-10-01 5 steps NA
## 3 2012-10-01 10 steps NA
## 4 2012-10-01 15 steps NA
## 5 2012-10-01 20 steps NA
## 6 2012-10-01 25 steps NA
Ignoring missing values, a histogram of total number of steps taken per day is plotted:
actsum <- dcast(act, date ~ variable, fun.aggregate = sum, na.rm=TRUE)
head(actsum)
## date steps
## 1 2012-10-01 0
## 2 2012-10-02 126
## 3 2012-10-03 11352
## 4 2012-10-04 12116
## 5 2012-10-05 13294
## 6 2012-10-06 15420
with(actsum, hist(steps, breaks=20))
The mean and median values of the total number of steps taken per day is calculated:
mean(actsum$steps)
## [1] 9354.23
median(actsum$steps)
## [1] 10395
Still ignoring missing values, the average number of steps taken in a five-minute interval (averaged over all days) is plotted as a time series:
actint <- dcast(act, interval ~ variable, fun.aggregate = mean, na.rm=TRUE)
head(actint)
## interval steps
## 1 0 1.7169811
## 2 5 0.3396226
## 3 10 0.1320755
## 4 15 0.1509434
## 5 20 0.0754717
## 6 25 2.0943396
with(data=actint, plot(x=interval, y=steps, type="l", main="Average steps per interval"))
The 5-minute interval with the maximum number of steps, averaged across all days, is found here:
actint[which.max(actint$steps),]
## interval steps
## 104 835 206.1698
The total number of missing steps values is calculated:
sum(is.na(act$value))
## [1] 2304
To impute the missing number of steps for a given time interval, the average value of steps for that time interval (averaged over all days) is used:
filled <- act
for (i in 1:length(filled$value)) {
if (is.na(filled[i,]$value)) {
int = filled[i,]$interval
filled[i,]$value <- actint[actint$interval==int,]$steps
}
}
actsum_filled <- dcast(filled, date ~ variable, fun.aggregate = sum, na.rm=TRUE)
head(actsum_filled)
## date steps
## 1 2012-10-01 10766.19
## 2 2012-10-02 126.00
## 3 2012-10-03 11352.00
## 4 2012-10-04 12116.00
## 5 2012-10-05 13294.00
## 6 2012-10-06 15420.00
The histogram for total number steps per day (with missing values imputed) is plotted and the mean and median values are calculated:
with(actsum_filled, hist(steps, breaks=20, main="Histogram of steps (with imputed missing values)"))
mean(actsum_filled$steps)
## [1] 10766.19
median(actsum_filled$steps)
## [1] 10766.19
Imputing values decreases the frequency of 0 values and increases the frequency of values at the peak around 10,000, as seen from a comparison of histograms. This increases the values for mean and median as well.
A new factor variable is created that indicates whether a given date is a weekday or weekend day, and the data is reshaped:
filled$day <- ifelse(weekdays(as.Date(filled$date)) %in% c("Saturday", "Sunday"),
"weekend", "weekday")
weekdaysteps <- dcast(filled, interval ~ day, fun.aggregate = mean)
head(weekdaysteps)
## interval weekday weekend
## 1 0 2.25115304 0.214622642
## 2 5 0.44528302 0.042452830
## 3 10 0.17316562 0.016509434
## 4 15 0.19790356 0.018867925
## 5 20 0.09895178 0.009433962
## 6 25 1.59035639 3.511792453
weekdaymelt <- melt(weekdaysteps, id.vars = "interval")
head(weekdaymelt)
## interval variable value
## 1 0 weekday 2.25115304
## 2 5 weekday 0.44528302
## 3 10 weekday 0.17316562
## 4 15 weekday 0.19790356
## 5 20 weekday 0.09895178
## 6 25 weekday 1.59035639
Finally, the average number of steps taken during a given 5-minute interval, averaged over all weekend days (top) or over all weekdays (bottom) are plotted as time series plots:
library(lattice)
xyplot(value~interval | variable, weekdaymelt, type="l", layout=c(1,2), xlab="Interval", ylab="Number of Steps", main="Average steps per interval on weekdays or weekend days")
The plots demonstrate that there are different patterns on weekdays and weekends, on average.