# **Executing R code distributed through Dask**

Dask can be used to distribute the executing of R code across a cluster.
In this example each worker gets the same R code to execute on different data.
It does not parallize the R algorithms or any of the code. The R code runs isolated in the thread of their worker.

This example is an extension of the [Dask Dataframe example](https://github.com/dask/dask-examples/blob/master/dataframe.ipynb).

### **Requirements for each worker:**
- [rpy2](https://rpy2.bitbucket.io/)
- [R](https://www.digitalocean.com/community/tutorials/how-to-install-r-on-ubuntu-16-04-2) 


### **Install rpy2**
**NOTE:** Make sure that R is already installed and available else this will not work properly.

In [None]:
!pip install rpy2

#### **Imports**

In [1]:
import pandas as pd
import numpy as np

import dask.dataframe as dd
from dask.distributed import Client

import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import importr

#### **Create dummy data**

We create a random timeseries of data with the following attributes:

    It stores a record for every 1 hour of the year 2000
    It splits that year by month, keeping every month as a separate Pandas dataframe
    Along with a datetime index it has columns for names, ids, and numeric values


In [2]:
df = dd.demo.make_timeseries('2000-01-01', '2000-12-31', freq='1h', partition_freq='1M',
                             dtypes={'name': str, 'id': int, 'x': float, 'y': float})



#### **Create the callable function**
Below is the function created with the R code in it. This acts like a regular python function but does call the interpreter. Keep in mind that the R code stays single threaded as it might behave unexpectedly if you're using R code that is multi-threaded.

In [3]:
def perform_r_ops(df):
    rstring="""
        function(df){{
            df_slice <- df$x
            mean(df_slice)                # The data that is returned to Python from the R function. A simple calculation for illustration purposes.
        }}
    """
    pandas2ri.activate()             #Activate the environment for each worker through the function
    rfunc=ro.r(rstring)              #Create an R object
    rdf = pandas2ri.DataFrame(df)    #Convert the pandas dataframe to an R dataframe
    r_df = rfunc(df)                 #Execute the R code with the R dataframe as argument
    return r_df                      #Return the output. For more info about extracting see: http://rpy2.readthedocs.io/en/version_2.8.x/vector.html#extracting-python-style

#### **Groupby Apply with RPY2**

In [4]:
df.groupby('name').apply(perform_r_ops, meta=object).compute()

name
Alice       [-0.020649483951557596]
Dan         [-0.010924596022318636]
Jerry         [0.04493333142680164]
Tim          [0.016334682264496882]
Hannah        [0.03049909519252575]
Quinn        [0.020877898211815096]
Ray          [-0.03867841478430033]
George       [-0.01505946773820897]
Yvonne       [0.025490626157857264]
Bob          [0.003996973040876134]
Norbert      [0.030308191014517465]
Victor       [0.009046161042206964]
Ingrid      [-0.043327977398960986]
Sarah         [0.03273205715961313]
Wendy        [-0.02527674242920192]
Frank        [-0.04919060194073418]
Laura        [0.029971801911119395]
Patricia    [-0.007046741838683191]
Ursula      [-0.008313532294525718]
Charlie       [0.04298031615209613]
Edith         [0.03715192411380558]
Kevin        [0.033735930718072585]
Xavier        [0.04024388947201336]
Michael      [0.016625469170806047]
Zelda        [-0.04460181790592972]
Oliver       [0.029763681682863113]
dtype: object