# How to work with R

* **Difficulty level**: easy
* **Time need to lean**: 15 minutes or less
* **Key points**:
  * There are intuitive corresponding data types between most Python (SoS) and R datatypes

## Installation

There are several options to install `R` and its jupyter kernel [irjernel](https://github.com/IRkernel/IRkernel), the easiest of which might be using `conda` but it could be tricky to install third-party libraries of R to conda, and mixing R packages from the `base` and `r` channels can lead to devastating results.

Anyway, after you have a working R installation with `irkernel` installed, you will need to install

* The `sos-r` language module,
* The `feather` library of R, and
* The `feather-format` module of Python

The feature modules are needed to exchange dataframe between Python and R

## Overview

SoS transfers Python variables in the following types to R as follows:

  
  | Python  |  condition |   R |
  | --- | --- |---|
  | `None` | &nbsp; |    `NULL` |
  | `integer` | &nbsp; |  `integer` |
  | `integer` | `large` | `numeric` |
  | `float` | &nbsp; |  `numeric` |
  | `boolean` | &nbsp;  | `logical` |
  | `complex` |&nbsp;  |  `complex` |
  | `str` | &nbsp; | `character` |
  | Sequence (`list`, `tuple`, ...) |  homogenous type |  `c()` |
  | Sequence (`list`, `tuple`, ...) |  multiple types |  `list` |
  | `set` | &nbsp; |  `list` |
  | `dict` | &nbsp; |  `list` with names |
  | `numpy.ndarray` | &nbsp; | array |
  | `numpy.matrix` | &nbsp; | `matrix` |
  | `pandas.DataFrame` |&nbsp;  |  R `data.frame` |



SoS gets variables in the following types to SoS as follows (`n` in `condition` column is the length of R datatype):
  
  | R  |  condition  |   Python |
  | --- | --- |---|
  | `NULL` | &nbsp;|    `None` |
  | `logical` |  `n == 1` |  `boolean` |
  | `integer` |  `n == 1` |  `integer` |
  | `numeric` |  `n == 1` |  `double` |
  | `character` |  `n == 1` |  `string` |
  | `complex` |  `n == 1` |  `complex` |
  | `logical` |  `n > 1` |  `list` |
  | `integer` |  `n > 1` |  `list` |
  | `complex` |  `n > 1` |  `list` |
  | `numeric` |  `n > 1` |  `list` |
  | `character` |  `n > 1` |  `list` |
  | `list` without names | &nbsp;  | `list` |
  | `list` with names | &nbsp;  |  `dict` (with ordered keys)|
  | `matrix` | &nbsp;  |  `numpy.array` |
  | `data.frame` | &nbsp; |  `DataFrame` |
  | `array` | &nbsp;  |  `numpy.array` |

One of the key problems in mapping R datatypes to Python is that R does not have scalar types and all scalar variables are actually array of size 1. That is to say, in theory, variable `a=1` should be represented in Python as `a=[1]`. However, because Python does differentiate scalar and array values, we chose to represent R arraies of size 1 as scalar types in Python.

In [2]:
%put a b
a = c(1)
b = c(1, 2)

In [3]:
print(f'a={a} with type {type(a)}')
print(f'b={b} with type {type(b)}')

a=1 with type <class 'int'>
b=[1, 2] with type <class 'list'>


## Simple data types

Most simple Python data types can be converted to R types easily,

In [5]:
null_var = None
int_var = 123
float_var = 3.1415925
logic_var = True
char_var = '1"23'
comp_var = 1+2j

In [6]:
%get null_var int_var float_var logic_var char_var comp_var
%preview -n null_var int_var float_var logic_var char_var comp_var

 NULL


 num 123


 num 3.14


 logi TRUE


 chr "1\"23"


 cplx 1+2i


The variables can be sent back to SoS without losing information

In [7]:
%get null_var int_var float_var logic_var char_var comp_var --from R
%preview -n null_var int_var float_var logic_var char_var comp_var

None

123

3.1415925

True

'1"23'

(1+2j)

However, because Python allows integers of arbitrary precision which is not supported by R, large integers would be presented in R as float point numbers, which might not be able to keep the precision of the original number.

For example, if we put a large integer with 18 significant digits to R

In [8]:
%put large_int --to R
large_int = 123456789123456789

The last digit would be different because of floating point presentation

In [9]:
%put large_int
large_int

This is not a problem with SoS because you would get the same result if you enter this number in R

In [12]:
123456789123456789

Consequently, if you send `large_int` back to `SoS`, the number would be different

In [13]:
%get large_int --from R
large_int

123456789123456784

## Array, matrix, and dataframe

The one-dimension (vector) data is converted from SoS to R as follows:

In [3]:
import numpy
import pandas
char_arr_var = ['1', '2', '3']
list_var = [1, 2, '3']
dict_var = dict(a=1, b=2, c='3')
set_var = {1, 2, '3'}
recursive_var = {'a': {'b': 123}, 'c': True}
logic_arr_var = [True, False, True]
seri_var = pandas.Series([1,2,3,3,3,3])

In [4]:
%get char_arr_var list_var dict_var set_var recursive_var logic_arr_var seri_var
%preview -n char_arr_var list_var dict_var set_var recursive_var logic_arr_var seri_var

The multi-dimension data is converted from SoS to R as follows:

In [5]:
num_arr_var = numpy.array([1, 2, 3, 4]).reshape(2,2)
mat_var = numpy.matrix([[1,2],[3,4]])

In [6]:
%get num_arr_var mat_var
%preview -n num_arr_var mat_var

0,1
1,2
3,4


0,1
1,2
3,4


The scalar data is converted from R to SoS as follows:

In [7]:
null_var = NULL
num_var = 123
logic_var = TRUE
char_var = '1\"23'
comp_var = 1+2i

In [8]:
%get null_var num_var logic_var char_var comp_var --from R
%preview -n null_var num_var logic_var char_var comp_var

None

123

True

'1"23'

(1+2j)

The one-dimension (vector) data is converted from R to SoS as follows:

In [9]:
num_vector_var = c(1, 2, 3)
logic_vector_var = c(TRUE, FALSE, TRUE)
char_vector_var = c(1, 2, '3')
list_var = list(1, 2, '3')
named_list_var = list(a=1, b=2, c='3')
recursive_var = list(a=1, b=list(c=3, d='whatever'))
seri_var = setNames(c(1,2,3,3,3,3),c(0:5))

In [10]:
%get num_vector_var logic_vector_var char_vector_var list_var named_list_var recursive_var seri_var --from R
%preview -n num_vector_var logic_vector_var char_vector_var list_var named_list_var recursive_var seri_var

[1, 2, 3]

[True, False, True]

['1', '2', '3']

[1, 2, '3']

{'a': 1, 'b': 2, 'c': '3'}

{'a': 1, 'b': {'c': 3, 'd': 'whatever'}}

0    1
1    2
2    3
3    3
4    3
5    3
dtype: int64

The multi-dimension data is converted from R to SoS as follows:

In [11]:
mat_var = matrix(c(1,2,3,4), nrow=2)
arr_var = array(c(1:16),dim=c(2,2,2,2))

In [12]:
%get mat_var arr_var --from R
%preview -n mat_var arr_var

array([[ 1.,  3.],
       [ 2.,  4.]])

array([[[[ 1,  3],
         [ 2,  4]],

        [[ 5,  7],
         [ 6,  8]]],


       [[[ 9, 11],
         [10, 12]],

        [[13, 15],
         [14, 16]]]])

It is worth noting that R's named `list` is transferred to Python as dictionaries but SoS preserves the order of the keys so that you can recover the order of the list. For example,

In [13]:
Rlist = list(A=1, C='C', B=3, D=c(2, 3))

Although the dictionary might appear to have different order

In [14]:
%get Rlist --from R
Rlist

{'A': 1, 'B': 3, 'C': 'C', 'D': [2, 3]}

The order of the keys and values are actually preserved

In [15]:
Rlist.keys()

dict_keys(['A', 'C', 'B', 'D'])

In [16]:
Rlist.values()

dict_values([1, 'C', 3, [2, 3]])

so it is safe to enumerate the R list in Python as

In [17]:
for idx, (key, val) in enumerate(Rlist.items()):
  print(f"{idx+1} item of Rlist has key {key} and value {val}")

1 item of Rlist has key A and value 1
2 item of Rlist has key C and value C
3 item of Rlist has key B and value 3
4 item of Rlist has key D and value [2, 3]


## Set, dictionary (mapping), and nested datatypes

## Further reading

* [Exchanging data among kernels](exchange_variable.html)