<h1 style="color:slateblue;font-family:courier;">Different ways and time required to calculate mean of a dataset with R buit-in functions</h1>

In [8]:
mean.by.column = function(X){
    v.mean = c()
    for (i in 1:ncol(X)){
        v.mean[i] = mean(X[, i])
    }
    names(v.mean) = names(X)
    v.mean
}

In [9]:
mean.by.sum.nrow = function(X){
    v.mean = c()
    for (i in 1:ncol(X)){
        v.mean[i] = sum(X[, i])/nrow(X)
    }
    names(v.mean) = names(X)
    v.mean
}

<h2 style="color:slateblue;font-family:courier;">Time required to compute each function</h2>

<h3 style="color:slateblue;font-family:courier;">Using the rbenchmark package</h3>

In [10]:
#install.packages("rbenchmark")
library(rbenchmark)

In [13]:
time_required = function(X){
    out = list()
    time = benchmark("colMeans" = {
            out$colmeans = colMeans(X)
          },
          "colSums/nrow" = {
            out$colsums = colSums(X)/nrow(X)
          },
          "apply" = {
            out$apply = apply(X, 2, mean)
          },
          "mean.by.column" = {
            out$mean1 = mean.by.column(X)
          },
          "mean.by.sum.nrow" = {
            out$mean2 = mean.by.sum.nrow(X)
          },
          replications = 1000,
          columns = c("test", "replications", "elapsed",
                      "relative", "user.self", "sys.self"))
    out$time = time[order(time$elapsed), ]
    out
}    

In [14]:
result = time_required(iris[, -5])
result$time

Unnamed: 0,test,replications,elapsed,relative,user.self,sys.self
4,mean.by.column,1000,0.11,1.0,0.11,0.0
5,mean.by.sum.nrow,1000,0.14,1.273,0.14,0.0
3,apply,1000,0.22,2.0,0.21,0.01
2,colSums/nrow,1000,0.25,2.273,0.25,0.0
1,colMeans,1000,0.29,2.636,0.29,0.0


In [17]:
wine.uci = read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", header = FALSE)

In [18]:
time_required(wine.uci[, -1])$time

Unnamed: 0,test,replications,elapsed,relative,user.self,sys.self
2,colSums/nrow,1000,0.28,1.0,0.28,0.0
5,mean.by.sum.nrow,1000,0.29,1.036,0.3,0.0
4,mean.by.column,1000,0.38,1.357,0.38,0.0
1,colMeans,1000,0.41,1.464,0.41,0.0
3,apply,1000,0.53,1.893,0.52,0.01


In [19]:
breast.cancer.uci = read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",
              header = FALSE)

In [20]:
time_required(breast.cancer.uci[,-2])$time

Unnamed: 0,test,replications,elapsed,relative,user.self,sys.self
2,colSums/nrow,1000,0.57,1.0,0.56,0.0
5,mean.by.sum.nrow,1000,0.73,1.281,0.72,0.0
1,colMeans,1000,0.75,1.316,0.69,0.05
4,mean.by.column,1000,1.01,1.772,0.99,0.0
3,apply,1000,1.25,2.193,1.2,0.01


<h2 style="color:slateblue;font-family:courier;">Considerations:</h2>

<ul style="font-size:20px;font-family:courier;">
  <li>The function <b>mean.by.column</b> that calculates the mean of each column inside a <i>for</i> was faster than using the <b>apply</b> function.</li>
    <li>The functions <b>mean.by.sum.nrow</b> and <b>colSums/nrow</b> were faster than the <b>colMeans</b> function.</li>
</ul> 