## Data Frames (Part II) Split-Apply-Combine

A common task in computer programming is to apply the same function to many sets of similar rows in a dataframe, and compile the results. 
For example, we could run a randomized control trial in different countries.
Within each country we woud randomize patients to a treatment and control group.
Our goal may be to compare the rate of adverse events between treated and control patients, ignoring the country where they were treated.
A common secondary analysis is to apply the same comparison for each country and investigate if there were any differences. 
The split-apply-combine paradigm would work well here. 

### Split-Apply-Combine
Suppose you have stored a data frame $D$ with column $c_{1},c_{2},\cdots,c_{n}$. 
In addition to these columns we have one extra column $C$ that contains the values $v_{1},v_{2},\cdots,v_{p}$ that **split** the rows of our dataframe. 

We also have a function $f$ that takes as input a dataframe and returns a result.

Our goal is to **apply** our function $f$ to the rows of our dataframe that correspond to $C == v_{1}$, and then **apply** our function to the rowd of our dataframe that correpond to $C==v_{2}$, and so on. 
After we gather our results for each subset, we finally would like to **combine** all the results into a single dataframe. 

R has a natural way to perform this operation. 

But first a dataset. 

In [3]:
doctorVisits = read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv")
head(doctorVisits)


Unnamed: 0_level_0,X,visits,gender,age,income,illness,reduced,health,private,freepoor,freerepat,nchronic,lchronic
Unnamed: 0_level_1,<int>,<int>,<fct>,<dbl>,<dbl>,<int>,<int>,<int>,<fct>,<fct>,<fct>,<fct>,<fct>
1,1,1,female,0.19,0.55,1,4,1,yes,no,no,no,no
2,2,1,female,0.19,0.45,1,2,1,yes,no,no,no,no
3,3,1,male,0.19,0.9,3,0,0,no,no,no,no,no
4,4,1,male,0.19,0.15,1,0,0,no,no,no,no,no
5,5,1,male,0.19,0.45,2,5,1,no,no,no,yes,no
6,6,1,female,0.19,0.35,5,1,9,no,no,no,yes,no


The description for this data set reads as follows "The sample consists of 5,190 observations and is from the 1977-78 Australian Health Survey and contains information on health service utilization and covariates describing factors that affect health care utilization propensities." 

A description of what each column represents can be found [here](https://vincentarelbundock.github.io/Rdatasets/doc/AER/DoctorVisits.html).

We see that each row represents a single patient studied from a Austrialian Health survey. 
They collected information about patients' gender, number of visits to a hospital, annual income divided by 10,000, whether the patient has private health insurance, the number of illnesses in the past two weeks and so on.

Suppose we want to compute the average number of illnesses in the past two weeks among patients with and without private healthcare. 
Our first step is to declare a function that takes as input a dataframe and returns a data frame that contains the average number of illnesses. 

We will call our function ``averageIllness``. 

In [17]:
averageIllness = function(d){       # the argument d is a dataframe
    mean_illness = mean(d$illness)  # select the column illness and compute the mean
    return( data.frame( "average_illness" = mean_illness) ) # return a data frame with this computation
}

We can apply our function to the entire Austrailan Health Survey to compute the average number of illnesses reported by patients in the past two weeks 

In [18]:
averageIlness(doctorVisits)

average_illness
<dbl>
1.431985


However, we wanted to compute the number of illnesses for those with private health insurance (when the column private==1) and for those without private health insurance. 

A more laborious way to compute the mean number of illnessess for each value of the variable private is to subset our dataframe by private and then apply our function. 

In [19]:
privateSubset = doctorVisits[ doctorVisits$private=="yes",] #logical index to select rows
averageIlness(privateSubset)

no_privateSubset = doctorVisits[ doctorVisits$private=="no",] #logical index to select rows
averageIlness(no_privateSubset)

average_illness
<dbl>
1.359443


average_illness
<dbl>
1.489627


But we can compute these two numbers more efficiently by using split-apply-combine. 

## Split-apply-combine needs to use a package

A package is a set of functions that were written by a team.
Packages are often avalable for download and installation. 

The package we will need is called ``plyr``. 
The ``plyr`` package includes functions that allow us to split-apply-combine.

There are two ways to access the functions in an installed package in R. 
One way uses the ``library`` command and the second way uses the ``require`` command. 
Both methods are ok to use. 

In [20]:
require(plyr)

When the above ``require(plyr)`` is executed by R, all the functions that are inside of the ``plyr`` package are available for you to use. 

### ddply
The ``ddply`` function takes three arguments: the data frame you wish to operate on, a variable or set of variables to used to split your data frame into many subsets, and a function that will be applied to each subset. 

The second argument, the variables to split our data frame by, requires that we enclose all variables with a ``.(var1,var2,var3)``. 

Lets look at our above example. 
We could supply the function ``ddply`` with the data frame ``doctorVisits``, the argument ``.(private)`` to split the data frame on the condition that ``private="yes"`` and ``private=no``, and our function to compute the mean number of illnesses in the past two weeks. 

In [21]:
ddply(doctorVisits, .(private), averageIllness )

private,average_illness
<chr>,<dbl>
no,1.489627
yes,1.359443


Above, the dataframe ``doctorVisits`` was split into two dataframes: one data frame $(D_{1})$ where the variable ``patient="yes"`` and a second data frame $(D_{2})$ where ``patient="no"``.
Then the function ``averageIllness`` is applied to $D_{1}$ and $D_{2}$, and the results are combined into a single data frame.

We can provide more than one variable to split our dataframe. 
For example, we could split (or stratify) by whether patients have private health insurance and the number of days the patient has had reduced mobility (the reduced variable). 

In [25]:
results = ddply(doctorVisits, .(private,reduced), averageIllness )
results

private,reduced,average_illness
<chr>,<int>,<dbl>
no,0,1.343386
no,1,2.011628
no,2,2.327273
no,3,2.666667
no,4,2.285714
no,5,2.619048
no,6,2.461538
no,7,2.631579
no,8,2.461538
no,9,3.25


The ``split-apply-combine`` paradigm is a powerful way to apply a function to many different subsets of a dataframe and simplifies our code. 

### Assignment:

Remember that the first step is always to understand the data you are working with. Look [here](https://vincentarelbundock.github.io/Rdatasets/doc/AER/DoctorVisits.html) for more information about the columns in the data.

1. Write a function called ``outline`` that takes as an argument a dataframe
    - The function will compute the sample mean, variance, and standard deviation for the variable age. 
    - The function should return a data frame with columns for each of these above three statistics. 
2. Apply our ``outline`` function to the Australian Health Survey (AHS)
3. Apply our ``outline`` function to patients in the AHS with and without private health insurance. 
4. Apply our ``outline`` function to patients in the AHS for all combination of presenance and absence of private health insurance (``private``) and number of illnesses in the past two weeks (``illness`` variable). 