# Task 2: R *apply* family
##### Daniel Alonso & Ander Iturburu

In [None]:
# loading some libraries
library(dplyr)
library(stats)

The *apply()* family pertains to the R base package and is populated with functions to manipulate slices of data from matrices, arrays, lists and dataframes in a repetitive way. These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs.


the family is made up of the *apply()*, *lapply()*, *sapply()*, *vapply()*, *mapply()*, *rapply()*, and *tapply()* functions.

In [None]:
data <- read.csv("/content/forestfires.csv")

In [None]:
head(data)

Unnamed: 0_level_0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
Unnamed: 0_level_1,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0
2,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0
3,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0
4,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0
5,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0
6,8,6,aug,sun,92.3,85.3,488.0,14.7,22.2,29,5.4,0.0,0


In [None]:
X <- data.matrix(data)

## *apply()* function

*apply()* is the main function of this family. The structure that this function is the following: *apply(X, MARGIN, FUN, ...)*, where *X* is the array in which we are going to apply the function, *MARGIN* is an integer which can be 1 or 2, whether we are working with a row or a column, respectively and *FUN* is the function we are going to apply to *X*, it can be an R function or a function created by the user.

In [None]:
apply(X,2,sum)

In the case we create a function the way of applying the function would be the following:

In [None]:
fun <- function(x){
  return (sum(x)/7)
}

In [None]:
apply(X,2,fun)

## *lapply()* function

*lapply()* function is similar to *apply()* but it can be used for other objects like dataframes, lists or vectors and the output will be a list.

In [None]:
l1 <- list(data$FFMC,data$DMC,data$DC)

In [None]:
lapply(l1,sum)

In addition, if we have a list with more list in each element we can select just a part of that list with this function converting the elements selected into another list.

In [None]:
l2 <- list(matrix(data$FFMC,data$DMC),matrix(data$DC,data$RH))

In [None]:
lapply(l2,'[',,2)

In the example above we extracted the 2nd column from **l2** with the selection operator `[` with *lapply()*.

## *sapply()* function

The *sapply()* function works like *lapply()*, but it tries to simplify the output to the most elementary data structure that is possible. Applying the lapply() function would give us a list unless you pass simplify=FALSE as a parameter to sapply().

In [None]:
s1 <- sapply(l2,'[',1,2)
s1

In [None]:
is.vector(s1)

In [None]:
is.list(s1)

In [None]:
s2<-sapply(l2,'[',1,2, simplify = F)
s2

In [None]:
is.list(s2)

In this example we can see that in the case we don't simplify the result obtained using *sapply()* the result is the same of doing the operation with *lapply()*. But, if we don't simplify we just get a vector, which is the elementary data structure. In fact, we can also obtain a vector with *lapply* if we *unlist* the list obtained by this function.

## *rep()* function





The *rep()* is not a function of the *apply()* family but it is regularly used with them. When you apply it to a vector or a factor x, the function replicates its values a specified number of times.

In [None]:
s1 <- sapply(l2,'[',1,2)

s1

s1 <- rep(s1,c(3,2))

s1

So, in this example we have replicated the first element three times and the second element two times.

## *mapply()* function

The *mapply()* function stands for ‘multivariate’ apply. Its purpose is to be able to vectorize arguments to a function that is not usually accepting vectors as arguments.

In short, *mapply()* applies a Function to Multiple List or multiple Vector Arguments.

In [None]:
# we could do this
m1 <- matrix(c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),5,5)

# but instead, we do this
m1 <- mapply(rep,1:5,5)
m1

0,1,2,3,4
1,2,3,4,5
1,2,3,4,5
1,2,3,4,5
1,2,3,4,5
1,2,3,4,5


As you can probably tell, this is a way shorter way of achieving the same thing and without using loops.

## Functions Related To *apply()*

### *sweep()* function

The *sweep()* function is probably the closest to the *apply()* family. You use it when you want to replicate different actions on the MARGIN elements that you have chosen (limiting here to the matrix case).

An example use for sweep would be manipulating a large or relatively large matrix column by column or row by row.

In [None]:
# for example
test_data <- head(data)
ffmc_dmc_dc <- data %>% select(FFMC:DC)

# summing 10 to FFMC, 20 to DMC and 30 to DC
ffmc_dmc_dc <- sweep(ffmc_dmc_dc, MARGIN=2, c(10,20,30), "+")
head(ffmc_dmc_dc)

Unnamed: 0_level_0,FFMC,DMC,DC
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1,96.2,46.2,124.3
2,100.6,55.4,699.1
3,100.6,63.7,716.9
4,101.7,53.3,107.5
5,99.3,71.3,132.2
6,102.3,105.3,518.0


Alternatively, we can use sweep to normalize data (for example) using the parameter STATS, were we subtract the mean of each entry in each column and divide each entry in each column by the standard deviation of each column. 

In [None]:
# get mean and std
test_data <- head(data)
ffmc_dmc_dc <- data %>% select(FFMC:DC)

means <- apply(ffmc_dmc_dc, 2, mean)
stds <- apply(ffmc_dmc_dc, 2, sd) 

# sweep through cols with mean and then std
norm_data <- sweep(sweep(ffmc_dmc_dc, 2, means, '-'), 2, stds, '/')
head(norm_data)

Unnamed: 0_level_0,FFMC,DMC,DC
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1,-0.805179637,-1.3220451,-1.8287056
2,-0.008094195,-1.1783995,0.4884179
3,-0.008094195,-1.0488061,0.5601729
4,0.191177166,-1.2111882,-1.8964295
5,-0.243596712,-0.9301423,-1.7968593
6,0.299870636,-0.3992778,-0.2416292


Note that the computations were performed column-wise because the parameter MARGIN=2 was used.

### *aggregate()* function

This function is contained in the stats package, and you use it like this: *aggregate(x, by, FUN, ..., simplify = TRUE)*.

It works similarly to the *apply()* function: you specify the object, the function and you say whether you want to simplify, just like with the *sapply()* function. The critical difference is the use of the by clause, which sets the variable or dataframe field by which we want to perform the aggregation.

In [None]:
# cols to aggregate
aggregate(data %>% select(FFMC,DMC,DC), by=list(month = data$month), FUN=mean)

month,FFMC,DMC,DC
<chr>,<dbl>,<dbl>,<dbl>
apr,85.78889,15.91111,48.55556
aug,92.33696,153.73261,641.07772
dec,84.96667,26.12222,351.24444
feb,82.905,9.475,54.67
jan,50.4,2.4,90.35
jul,91.32812,110.3875,450.60312
jun,89.42941,93.38235,297.70588
mar,89.44444,34.54259,75.94259
may,87.35,26.7,93.75
nov,79.5,3.0,106.7


In this previous example we did an aggregation by month using mean as the aggregation function, this works similarly to using group_by() and then summarize(), however, the way we did it here shows a shorter way of doing so.

We can perform a similar aggregation by perhaps selecting only fridays of each month.


In [None]:
# doing the same aggregation by using a filtering by day
data <- data %>% filter(day == "fri ")
aggregate(data %>% select(FFMC,DMC,DC), 
            by=list(month = data$month), FUN=mean)

month,FFMC,DMC,DC
<chr>,<dbl>,<dbl>,<dbl>
apr,83.0,23.3,85.3
aug,91.06667,161.41905,665.2
dec,84.7,26.7,352.6
feb,85.42,8.36,39.64
jul,90.2,110.46667,381.5
jun,91.56667,81.53333,299.16667
mar,90.00909,34.46364,81.19091
may,89.6,25.4,73.7
oct,90.0,41.5,682.6
sep,92.18421,125.61316,738.88947
