# **Parallilize R code with Dask**

Dask can be used to distribute the executing of R code across a cluster.
In this example each worker gets the same R code to execute on different data.
It does not parallize the R algorithms or any of the code. The R code runs isolated in the thread of their worker.

This example is an extension of the [Dask Dataframe example](https://github.com/dask/dask-examples/blob/master/dataframe.ipynb).

### **Requirements for each worker:**
- [rpy2](https://rpy2.bitbucket.io/)
- [R](https://www.digitalocean.com/community/tutorials/how-to-install-r-on-ubuntu-16-04-2) 


### **Install rpy2**
**NOTE:** Make sure that R is already installed and available else this will not work properly.

In [None]:
!pip install rpy2

#### **Imports**

In [1]:
import pandas as pd

import dask.dataframe as dd

import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import importr

#### **Create dummy data**

We create a random timeseries of data with the following attributes:

    It stores a record for every 1 hour of the year 2000
    It splits that year by month, keeping every month as a separate Pandas dataframe
    Along with a datetime index it has columns for names, ids, and numeric values


In [2]:
df = dd.demo.make_timeseries('2000-01-01', '2000-12-31', freq='1h', partition_freq='1M',
                             dtypes={'name': str, 'id': int, 'x': float, 'y': float})

#### **Create the callable function**
Below is the function created with the R code in it. This acts like a regular python function but does call the interpreter. Keep in mind that the R code stays single threaded as it might behave unexpectedly if you're using R code that is multi-threaded.

In [3]:
def perform_r_ops(df):
    rstring="""
        function(df){{
            df_slice <- df$x
            mean(df_slice)                # The data that is returned to Python from the R function. A simple calculation for illustration purposes.
        }}
    """
    pandas2ri.activate()             # Activate the environment for each worker through the function
    rfunc = ro.r(rstring)            # Create an R object
    rdf = pandas2ri.DataFrame(df)    # Convert the pandas dataframe to an R dataframe
    r_df = rfunc(df)                 # Execute the R code with the R dataframe as argument
    return r_df                      # Return the output. 

### Extracting the return values of Rpy2

RPY2 returns their own objects and can be accesed through several ways.
For more info about extracting see: http://rpy2.readthedocs.io/en/version_2.8.x/vector.html#extracting-python-style

#### **Groupby Apply with RPY2**

In [4]:
df.groupby('name').apply(perform_r_ops, meta=object).compute()

name
Alice       [0.0012763385464659885]
Dan          [0.010587556462828133]
Jerry       [-0.023687313832477828]
Tim         [-0.006049016361787574]
Hannah       [-0.03319578058233784]
Quinn         [0.00455131760669532]
Ray         [-0.005108295691014426]
George        [-0.0508185484311886]
Yvonne      [-0.009431459192043835]
Bob           [-0.0491188247223946]
Norbert       [0.03936579716800174]
Victor       [0.004495791937725107]
Ingrid       [0.008424909189591887]
Sarah        [0.001449296236466594]
Wendy        [-0.03691536377249662]
Frank         [0.01254856742189228]
Laura       [-0.003451640891369274]
Patricia     [-0.03478453625919576]
Ursula        [0.07816982991077753]
Charlie       [0.05101962577097287]
Edith        [0.017001791503198777]
Kevin        [-0.07798130286562137]
Xavier       [-0.03734307137742184]
Michael      [0.010179924320114647]
Zelda        [-0.03624956266976925]
Oliver       [0.013192828490722256]
dtype: object