ryp is a minimalist, powerful Python library for:
- running R code inside Python
- quickly transferring huge datasets between Python (NumPy/pandas/polars) and R without writing to disk
- interactively working in both languages at the same time
ryp is an alternative to the widely used rpy2 library. Compared to rpy2, ryp provides:
- increased stability
- a much simpler API, with less of a learning curve
- interactive printouts of R variables that match what you'd see in R
- a full-featured R terminal inside Python for interactive work
- inline plotting in Jupyter notebooks (requires the
svglite
R package) - much faster data conversion with Arrow (also provided by rpy2-arrow)
- support for every NumPy, pandas and polars data type representable in base R, no matter how obscure
- support for sparse arrays/matrices
- recursive conversion of containers like R lists, Python tuples/lists/dicts, and S3/S4/R6 objects
- full Windows support
ryp does the opposite of the reticulate R library, which runs Python inside R.
Install ryp via pip:
pip install ryp
conda:
conda install ryp
or mamba:
mamba install ryp
Or, install the development version via pip:
pip install git+https://github.com/Wainberg/ryp
ryp's only mandatory dependencies are:
- Python 3.7+
- R
- the cffi Python package
- the pyarrow Python package, which includes NumPy as a dependency
- the arrow R library
R and the arrow R library are automatically installed when installing ryp via
conda or mamba, but not via pip. ryp uses the R installation pointed to by the
environment variable R_HOME
, or if R_HOME
is not defined or not a
directory, by running R RHOME
through subprocess.run()
.
ryp also has several optional dependencies, which are not installed automatically with pip, conda or mamba. These are:
- pandas, for
format='pandas'
- polars, for
format='polars'
- SciPy and the Matrix R library, for sparse matrices
- the svglite R library, for inline plotting in Jupyter notebooks
ryp consists of just three core functions:
r(R_code)
runs a string of R code.r()
with no arguments opens up an R terminal inside your Python terminal for interactive work.to_r(python_object, R_variable_name)
converts a Python object into an R object namedR_variable_name
.to_py(R_statement)
converts the R object produced by evaluatingR_statement
to Python.R_statement
may be a single variable name, or a more complex code snippet that evaluates to the R object you'd like to convert.
and two more functions, get_config()
and set_config()
, for getting and
setting ryp's global configuration options.
r(R_code: str = ...) -> None
r(R_code)
runs a string of R code inside ryp's R interpreter, which is
embedded inside Python. It can contain multiple statements separated by
semicolons or newlines (e.g. within a triple-quoted Python string). It returns
None
; use to_py()
instead if you would like to convert the result back to
Python.
r()
with no arguments opens up an R terminal inside your Python terminal
for interactive debugging. Press Ctrl + D
to exit back to the Python
terminal. R variables defined from Python will be available in the R terminal,
and variables defined in the R terminal will be available from Python once you
exit:
>>> from ryp import r
>>> r('a = 1')
>>> r()
> a
[1]
1
> b <- 2
>
>>> r('b')
[1]
2
Note that the default value for R_code
is the special sentinel value ...
(Ellipsis
) rather than None
. This stops users from inadvertently opening
the terminal when passing a variable that is supposed to be a string but is
unexpectedly None
.
to_r(python_object: object, R_variable_name: str, *,
format: Literal['keep', 'matrix', 'data.frame'] | None = None,
rownames: object = None, colnames: object = None) -> None
to_r(python_object, R_variable_name)
converts python_object
to R, adding it
to R's global namespace (globalenv
) as a variable named R_variable_name
.
If python_object
is a container (list
, tuple
, or dict
), to_r()
recursively converts each element and returns an R named list (if
python_object
is a dict
) or unnamed list (if python_object
is a list
or
tuple
).
By default (format='keep'
), ryp converts polars and pandas DataFrames (and
pandas MultiIndexes) into R data frames, and 2D NumPy arrays into R matrices.
Specify format='matrix'
to convert everything (even DataFrames) to R matrices
(in which case all DataFrame columns must have the same type), and
format='data.frame'
to convert everything (even 2D NumPy arrays) to R
data frames.
format
must be None
unless python_object
is a DataFrame, MultiIndex or 2D
NumPy array – or unless python_object
is a list
, tuple
, or dict
, in
which case the format
will apply recursively to any DataFrames, MultiIndexes
or 2D NumPy arrays it contains.
Since NumPy arrays, polars DataFrames and Series, and scipy sparse arrays and
matrices lack row and column names, you can specify these separately via the
rownames
and/or colnames
arguments, and they will be added to the converted
R object. rownames
and colnames
can be lists, tuples, string Series, or
categorical Series with string categories, and will be automatically converted
to R character vectors.
rownames
and colnames
must match the length or shape[1]
, respectively, of
the object being converted. The one exception is that rownames of any length
may be added to a 0 × 0 polars DataFrame, since polars does not have the
concept of an N
× 0 DataFrame for nonzero N
. (Dropping all the
columns of a polars DataFrame always results in a 0 × 0 DataFrame, even
if the original DataFrame had more than 0 rows.)
Because Python bool
, int
, float
, and str
convert to length-1 R vectors
that support names, you can pass length-1 rownames
when converting objects of
these types. You can also pass rownames
and/or colnames
when
python_object
is a list
, tuple
, or dict
, in which case row and column
names will only be added to elements that support them. All elements that
support rownames
must have the same length as the rownames
, and similarly
for the colnames
.
rownames
cannot be specified if python_object
is a pandas Series or
DataFrame (since they already have rownames, i.e. an index), or
bytes
/bytearray
(since these convert to raw
vectors, which lack
rownames). colnames
cannot be specified unless python_object
is a
multidimensional NumPy array or scipy sparse array or matrix, or something that
might contain one (list
, tuple
, or dict
).
to_py(R_statement: str, *,
format: Literal['polars', 'pandas', 'pandas-pyarrow', 'numpy'] |
dict[Literal['vector', 'matrix', 'data.frame'],
Literal['polars', 'pandas', 'pandas-pyarrow',
'numpy']] | None = None,
index: str | Literal[False] | None = None,
squeeze: bool | None = None) -> Any
to_py(R_statement)
runs a single statement of R code (which can be as simple
as a single variable name) and converts the resulting R object to Python.
If the object is a list/S3 object, S4 object, or environment/R6 object, it
recursively converts each attribute/slot/field and returns a Python dict
(or
list
, if the object is an unnamed list). For R6 objects, only public fields
will be converted.
By default, or when format='polars'
, R vectors will be converted to polars
Series, and R data frames and matrices will be converted to polars DataFrames.
You can change this by setting the format
argument to 'pandas'
,
'pandas-pyarrow'
(like 'pandas'
, but converting to pyarrow dtypes wherever
possible) or 'numpy'
. (You can also change the default format, e.g. with
set_config(to_py_format='pandas')
.)
For finer-grained control, you can set format
for only certain R variable
types by specifying a dictionary with 'vector'
, 'matrix'
, and/or
'data.frame'
as keys and 'polars'
, 'pandas'
, 'pandas-pyarrow'
and/or
'numpy'
as values.
format
must be None
when R_statement
evaluates to NULL
, when it
evaluates to an array of 3 or more dimensions (these are always converted to
NumPy arrays), or when the final result would be a Python scalar (see squeeze
below).
By default, the R object's names
or rownames
will become the index (for
pandas) or the first column (for polars) of the output Python object, named
'index'
. Set the index
argument to a different string to change this name,
or set index=False
to not convert the names
/rownames
.
Note that for polars, the output will be a two-column DataFrame (not a Series!)
when the input is an R vector, unless index=False
.
When the output is a NumPy array, names
and rownames
will always be
discarded, since numeric NumPy arrays cannot store string indexes except with
the inefficient dtype=object
.
index
must be None
when format='numpy'
, or when the final result would be
a Python scalar (see squeeze
below).
By default, length-1 R vectors, matrices and arrays will be converted to Python
scalars instead of Python arrays, Series or DataFrames. Set squeeze=False
to
disable this special case. (R data frames are never converted to Python scalars
even if squeeze=True
.)
squeeze
must be None
unless the R object is a vector, matrix or array
(raw
vectors don't count, because they always convert to Python scalars).
set_config(*, to_r_format=None, to_py_format=None, index=None, squeeze=None,
plot_width: int | float | None = None,
plot_height: int | float | None = None) -> None
set_config
sets ryp's configuration settings. Arguments that are None
are
left unchanged.
For instance, to set pandas as the default format in to_py()
, run
set_config(to_py_format='pandas')
.
to_r_format
: the default value for theformat
parameter into_r()
; must be'keep'
(the default),'matrix'
, or'data.frame'
.to_py_format
: the default value for theformat
parameter into_py()
; must be'polars'
(the default),'pandas'
,'pandas-pyarrow'
,'numpy'
, or a dictionary with one of those four Python formats and/orNone
as values and'vector'
,'matrix'
and/or'data.frame'
as keys. If certain keys are missing or haveNone
as the format, leave their format unchanged.index
: the default value for theindex
parameter in to_py(); must be a string (default:'index'
) orFalse
.squeeze
: the default value for thesqueeze
parameter into_py()
; must
beTrue
(the default) orFalse
.plot_width
: the width, in inches, of inline plots in Jupyter notebooks; must be a positive number. Defaults to 6.4 inches, to match Matplotlib's default.plot_height
: the height, in inches, of inline plots in Jupyter notebooks; must be a positive number. Defaults to 4.8 inches, to match Matplotlib's default.
For additional customization, users can specify ryp-specific settings in their
.Rprofile
:
if ("ryp" %in% commandArgs()) {
# Custom settings for running R within ryp
} else {
# Custom settings for native R
}
get_config() -> dict[str, dict[str, str] | str | bool | int]
get_config
returns the current configuration options as a dictionary, with
keys to_r_format
, to_py_format
, index
, squeeze
, plot_width
, and
plot_height
.
Python | R |
---|---|
None |
NULL (if scalar) or NA (if inside NumPy, pandas or polars) |
nan |
NaN (if scalar or inside polars) or NA (if inside NumPy or pandas) |
pd.NA |
NA |
pd.NaT , np.datetime64('NaT') , np.timedelta64('NaT') |
NA |
bool |
length-1 logical vector |
int |
length-1 integer (if abs(x) <= 2_147_483_647 ) or bit64::integer64 vector |
float |
length-1 numeric vector |
str |
length-1 character vector |
complex |
length-1 complex vector |
datetime.date |
length-1 Date vector |
datetime.datetime |
length-1 POSIXct vector |
datetime.timedelta |
length-1 difftime(units='secs') vector |
datetime.time (tzinfo must be None ) |
length-1 hms::hms vector |
bytes , bytearray |
raw vector |
list , tuple |
unnamed list |
dict (all keys must be strings) |
named list |
polars Series, pandas Series*, pandas Index |
vector |
polars DataFrame, pandas DataFrame*, pandas MultiIndex |
matrix† (if format == 'matrix' ; all columns must have same data type) or data.frame |
1D NumPy array | vector |
2D NumPy array | data.frame (if format == 'data.frame' ) or matrix† |
≥ 3D NumPy array | array† |
0D NumPy array (e.g. np.array(1) ), NumPy generic (e.g. np.int32(1) ) |
length-1 vector |
csr_array , csr_matrix |
dgRMatrix (if int or float), lgRMatrix (if boolean), -- (if complex) |
csc_array , csc_matrix |
dgCMatrix (if int or float), lgCMatrix (if boolean), -- (if complex) |
coo_array , coo_matrix |
dgTMatrix (if int or float), lgTMatrix (if boolean), -- (if complex) |
Python | R |
---|---|
bool |
logical |
int8 , uint8 , int16 , uint16 , int32 |
integer |
uint32 , uint64 |
integer (if x <= 2_147_483_647 ) or numeric |
int64 |
integer (if abs(x) <= 2_147_483_647 ) or bit64::integer64 |
float16 , float32 , float64 , float128 |
numeric (note: float128 loses precision) |
complex64 , complex128 |
complex |
bytes (e.g. 'S1' ) |
-- |
str /unicode (e.g. 'U1' ) |
character |
datetime64 |
POSIXct |
timedelta64 |
difftime(units='secs') |
void (unstructured) |
raw |
void (structured) |
-- |
object |
depends on the contents‡ |
Python | R |
---|---|
BooleanDtype |
logical |
Int8Dtype , UInt8Dtype , Int16Dtype , UInt16Dtype , Int32Dtype |
integer |
UInt32Dtype , UInt64Dtype |
integer (if x <= 2_147_483_647 ) or numeric |
Int64Dtype |
integer (if abs(x) <= 2_147_483_647 ) or bit64::integer64 |
Float32Dtype , Float64Dtype |
numeric |
StringDtype |
character |
CategoricalDtype(ordered=False) |
unordered factor |
CategoricalDtype(ordered=True) |
ordered factor |
DatetimeTZDtype , PeriodDtype |
POSIXct |
IntervalDtype , SparseDtype |
-- |
Python | R |
---|---|
pa.bool_ |
logical |
pa.int8 , pa.uint8 , pa.int16 , pa.uint16 , pa.int32 |
integer |
pa.uint32 , pa.uint64 |
integer (if x <= 2_147_483_647 ) or numeric |
pa.int64 |
integer (if abs(x) <= 2_147_483_647 ) or bit64::integer64 |
pa.float32 , pa.float64 |
numeric |
pa.string , pa.large_string |
character |
pa.date32 |
Date |
pa.date64 , pa.timestamp |
POSIXct |
pa.duration |
difftime(units='secs') |
pa.time32 , pa.time64 |
hms::hms |
pa.dictionary(any integer type, pa.string(), ordered=0) |
unordered factor |
pa.dictionary(any integer type, pa.string(), ordered=1) |
ordered factor |
pa.null() |
vctrs::unspecified |
Python | R |
---|---|
Boolean |
logical |
Int8 , UInt8 , Int16 , UInt16 , Int32 |
integer |
UInt32 , UInt64 |
integer (if x <= 2_147_483_647 ) or numeric |
Int64 |
integer (if abs(x) <= 2_147_483_647 ) or bit64::integer64 |
Float32 , Float64 |
numeric |
Date |
Date |
Datetime |
POSIXct |
Duration |
difftime(units='secs') |
Time |
hms::hms |
String |
character |
Categorical |
unordered factor |
Enum |
ordered factor |
Object |
depends on the contents‡ |
Null |
vctrs::unspecified |
Binary , Decimal , List , Array |
-- |
* For pandas Series and DataFrames, string indexes (and
categorical indexes where the categories are strings) will be automatically
converted to names
/rownames
. The default index
(pd.RangeIndex(len(python_object))
) will be ignored. All other indexes are
disallowed.
† Because R does not support POSIXct
and Date
matrices or
arrays, dates and datetimes cannot be converted to R matrices or arrays.
‡ For dtype=object
and dtype=pl.Object
, the output R type
depends on the contents, e.g. 'character'
if all elements are strings. Some
additional notes on ryp's handling of object data types:
None
,np.nan
,pd.NA
,pd.NaT
,np.datetime64('NaT')
, andnp.timedelta64('NaT')
are all treated as missing values – even for polars, wherenp.nan
is ordinarily treated as a floating-point number rather than a missing value.- Length-0 and all-missing data will be converted to the
vctrs::unspecified
R type (vctrs
is part of the tidyverse). - If the elements are objects with a mix of types (or datetimes with a mix of time zones), Arrow will generally cause the conversion to fail, though mixes of related types (e.g. int and float) will be automatically cast to the common supertype and succeed.
- Conversion will also fail if the contents are objects that are not
representable as R vector elements. This includes
bytes
/bytearray
(which are only representable in R when scalar, as araw
vector) and Python containers (list
,tuple
, anddict
). - pandas
Timedelta
objects will be rounded down to the nearest microsecond, following the behavior of Arrow.
R | Python |
---|---|
NULL |
None |
NA |
None (if scalar or format='polars' ), None /nan /pd.NA /pd.NaT /np.datetime64('NaT', 'us') /np.timedelta64('NaT', 'ns') /etc. (if format='numpy' 'pandas' or 'pandas-pyarrow' ) |
NaN |
nan |
length-1 vector, matrix or array, squeeze == False |
scalar |
vector or 1D array, format == 'numpy' |
1D NumPy array |
vector or 1D array, format == 'pandas' or format == 'pandas-pyarrow' |
pandas Series |
vector or 1D array, format == 'polars' |
polars Series (if index=False ) or two-column DataFrame |
matrix or data.frame , format == 'numpy' |
2D NumPy array |
matrix or data.frame , format == 'pandas' or format == 'pandas-pyarrow' |
pandas DataFrame |
matrix or data.frame , format == 'polars' |
polars DataFrame |
≥ 3D array | NumPy array |
unnamed list | list |
named list, S3 object, S4 object, environment, S6 object | dict |
dgRMatrix |
csr_array(dtype='int32') |
dgCMatrix |
csc_array(dtype='int32') |
dgTMatrix |
coo_array(dtype='int32') |
lgRMatrix , ngRMatrix |
csr_array(dtype=bool) |
lgCMatrix , ngCMatrix |
csc_array(dtype=bool) |
lgTMatrix , ngTMatrix |
coo_array(dtype=bool) |
formula (~ ) |
-- |
R | Python scalar | NumPy | pandas | pandas-pyarrow | polars |
---|---|---|---|---|---|
logical |
bool |
bool |
bool |
ArrowDtype(pa.bool_()) |
Boolean |
integer |
int |
int32 |
int32 |
ArrowDtype(pa.int32()) |
Int32 |
bit64::integer64 |
int |
int64 |
int64 |
ArrowDtype(pa.int64()) |
Int64 |
numeric |
float |
float |
float |
ArrowDtype(pa.float64()) |
Float64 |
character |
str |
object (with str elements) |
object (with str elements) |
ArrowDtype(pa.string()) |
String |
complex |
complex |
complex128 |
complex128 |
complex128 |
-- |
raw |
bytearray |
-- | -- | -- | -- |
unordered factor |
str |
object (with str elements) |
CategoricalDtype(ordered=False) |
ArrowDtype(pa.dictionary(pa.int8(), pa.string(), ordered=0)) |
Categorical |
ordered factor |
str |
object (with str elements) |
CategoricalDtype(ordered=True) |
ArrowDtype(pa.dictionary(pa.int8(), pa.string(), ordered=1)) |
Enum |
POSIXct without time zone |
datetime.datetime * |
datetime64[us] * |
datetime64[us] * |
ArrowDtype(pa.timestamp('us')) * |
Datetime('us') * |
POSIXct with time zone |
datetime.datetime * |
datetime64[us] * (time zone discarded) |
DatetimeTZDtype('us', time_zone) * |
ArrowDtype(pa.timestamp('us', time_zone)) * |
Datetime('us, time_zone) * |
POSIXlt |
dict of scalars |
dict of NumPy arrays |
dict of pandas Series |
dict of pandas Series |
dict of polars Series |
Date |
datetime.date |
datetime64[D] |
datetime64[ms] |
ArrowDtype(pa.date32('day')) |
Date |
difftime |
datetime.timedelta * |
timedelta64[ns] |
timedelta64[ns] |
ArrowDtype(pa.duration('ns')) |
Duration(time_unit='ns') |
hms::hms |
datetime.time * |
object (with datetime.time elements)* |
object (with datetime.time elements)* |
ArrowDtype(pa.time64('ns')) * |
Time |
vctrs::unspecified |
None |
object (with None elements) |
object (with None elements) |
ArrowDtype(pa.null()) |
Null |
* Due to the limitations of conversion with Arrow, POSIXct
and
hms::hms
values are rounded down to the nearest microsecond when converting
to Python, except for hms::hms
when converting to polars. difftime
values
are also rounded down to the nearest microsecond, but only when converting to
scalar datetime.timedelta
values (which cannot represent nanoseconds).
- Apply R's
scale()
function to a pandas DataFrame:
import pandas as pd
from ryp import r, to_py, to_r
data = pd.DataFrame({'a': [1, 2, 3], 'b': [1, 3, 4]})
to_r(data, 'data')
r('data')
# a b
# 1 1 1
# 2 2 3
# 3 3 4
r('data <- scale(data)') # scale the R data.frame
scaled_data = to_py('data', format='pandas') # convert the R data.frame to Python
scaled_data
# a b
# 0 -1.0 -1.091089
# 1 0.0 0.218218
# 2 1.0 0.872872
Note: we could have just written to_py('scale(data)')
instead of
r('data <- scale(data)')
followed by to_py('data')
. We could also have
run set_config(to_py_format='pandas')
at the top, to avoid having to specify
format='pandas'
in each to_py()
call.
- Run a linear model on a polars DataFrame:
import polars as pl
from ryp import r, to_py, to_r
data = pl.DataFrame({'y': [7, 1, 2, 3, 6], 'x': [5, 2, 3, 2, 5]})
to_r(data, 'data')
r('model <- lm(y ~ x, data=data)')
coef = to_py('summary(model)$coefficients', index='variable')
p_value = coef.filter(variable='x').select('Pr(>|t|)')[0, 0]
p_value
# 0.02831035772841884
- Recursive conversion, showcasing all the keyword arguments of
to_r()
andto_py()
:
import numpy as np
from ryp import r, to_py, to_r
arrays = {'ints': np.array([[1, 2], [3, 4]]),
'floats': np.array([[0.5, 1.5], [2.5, 3.5]])}
to_r(arrays, 'arrays', format='data.frame',
rownames = ['row1', 'row2'], colnames = ['col1', 'col2'])
r('arrays')
# $ints
# col1 col2
# row1 1 2
# row2 3 4
#
# $floats
# col1 col2
# row1 0.5 1.5
# row2 2.5 3.5
arrays = to_py('arrays', format='pandas', index='foo')
arrays['ints']
# col1 col2
# foo
# row1 1 2
# row2 3 4
arrays['floats']
# col1 col2
# foo
# row1 0.5 1.5
# row2 2.5 3.5