<a href="https://colab.research.google.com/github/coatless/colab-notes/blob/main/01-pandas-to-r-data-types.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# (Aside) Fix rpy2 version

Downgrade to address [Conversion 'py2rpy' not defined for objects of type '<class 'str'>'](https://stackoverflow.com/questions/74283327/conversion-py2rpy-not-defined-for-objects-of-type-class-str)

In [2]:
# Downgrade rpy2 from 3.5.5
!pip install rpy2==3.5.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rpy2==3.5.1
  Downloading rpy2-3.5.1.tar.gz (201 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m201.7/201.7 KB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rpy2
  Building wheel for rpy2 (setup.py) ... [?25l[?25hdone
  Created wheel for rpy2: filename=rpy2-3.5.1-cp38-cp38-linux_x86_64.whl size=310205 sha256=a2902d09ce7d5eeec45f0ee1e950dc67cfeff9a51b13cd1bf2e2e1ad5d15de2b
  Stored in directory: /root/.cache/pip/wheels/6b/40/7d/f63e87fd83e8b99ee837c8e3489081c4b3489134bc520235ed
Successfully built rpy2
Installing collected packages: rpy2
  Attempting uninstall: rpy2
    Found existing installation: rpy2 3.5.5
    Uninstalling rpy2-3.5.5:
      Successfully uninstalled rpy2-3.5.5
Successfully installed rpy2-3.5.1


# Conversion of Pandas to R Data Types

Create a data frame in Pandas and observe types from different variable initializations.

In [3]:
import pandas as pd 
import numpy as np 

dft = pd.DataFrame(
    {
        # Scalars
        "integer": 1,
        "numeric": 3.14,
        "logical": False,
        "character": "foo",
        "complex": complex(1, 2),
        # Series
        "numeric-list": pd.Series([1.0] * 3).astype("float32"),
        "integer-list": pd.Series([1] * 3, dtype="int8"),
        "complex-list": pd.Series(np.array([1, 2, 3]) + np.array([4, 5, 6]) *1j).astype("complex128"),
        "character-list": pd.Series(["hello", "world", "stat"]),
        "logical-list": pd.Series([True, False, True]),
        "character-string-list": pd.Series(["a", "b", "c"], dtype="string"),
        # Time Dependency: https://pandas.pydata.org/docs/user_guide/timeseries.html
        "POSIXct-POSIXt-timestamp": pd.Timestamp("20230102"),
        "POSIXct-POSIXt-date_range": pd.date_range("2023", freq="D", periods=3),
        #"POSIXct-POSIXt-period": pd.period_range("1/1/2011", freq="M", periods=3), # Not supported in rpy2
        #"POSIXct-POSIXt-timedelta": pd.to_timedelta(np.arange(3), unit="s"), # Not supported in rpy2
        # Categorical: https://pandas.pydata.org/docs/user_guide/categorical.html
        "factor": pd.Categorical(["a", "b", "c"], ordered=False),
        "ordered-factor": pd.Categorical(["a", "b", "c"], categories=["a", "b", "c"], ordered=True),
    }
)

dft

Unnamed: 0,integer,numeric,logical,character,complex,numeric-list,integer-list,complex-list,character-list,logical-list,character-string-list,POSIXct-POSIXt-timestamp,POSIXct-POSIXt-date_range,factor,ordered-factor
0,1,3.14,False,foo,1.0+2.0j,1.0,1,1.0+4.0j,hello,True,a,2023-01-02,2023-01-01,a,a
1,1,3.14,False,foo,1.0+2.0j,1.0,1,2.0+5.0j,world,False,b,2023-01-02,2023-01-02,b,b
2,1,3.14,False,foo,1.0+2.0j,1.0,1,3.0+6.0j,stat,True,c,2023-01-02,2023-01-03,c,c


In [4]:
dft.dtypes

integer                               int64
numeric                             float64
logical                                bool
character                            object
complex                          complex128
numeric-list                        float32
integer-list                           int8
complex-list                     complex128
character-list                       object
logical-list                           bool
character-string-list                string
POSIXct-POSIXt-timestamp     datetime64[ns]
POSIXct-POSIXt-date_range    datetime64[ns]
factor                             category
ordered-factor                     category
dtype: object

Check by passing the data frame into _R_ using `rpy2`:

In [5]:
%load_ext rpy2.ipython

In [6]:
%%R -i dft

sapply(dft, class)

$integer
[1] "integer"

$numeric
[1] "numeric"

$logical
[1] "logical"

$character
[1] "character"

$complex
[1] "complex"

$`numeric-list`
[1] "numeric"

$`integer-list`
[1] "integer"

$`complex-list`
[1] "complex"

$`character-list`
[1] "character"

$`logical-list`
[1] "logical"

$`character-string-list`
[1] "character"

$`POSIXct-POSIXt-timestamp`
[1] "POSIXct" "POSIXt" 

$`POSIXct-POSIXt-date_range`
[1] "POSIXct" "POSIXt" 

$factor
[1] "factor"

$`ordered-factor`
[1] "ordered" "factor" 



# Conversion Lookup Table

| R                                                                                       | Pandas     |
| --------------------------------------------------------------------------------------- | --------- |
| integer                                                                                 | [`Int{8,16,32,64}`](https://pandas.pydata.org/docs/user_guide/integer_na.html#integer-na), [`UInt{8,16,32,64}`](https://pandas.pydata.org/docs/user_guide/integer_na.html#integer-na), [`np.int{8,16,32,64}`](https://numpy.org/doc/stable/reference/arrays.scalars.html#sized-aliases), [`np.uint{8,16,32,64}`](https://numpy.org/doc/stable/reference/arrays.scalars.html#sized-aliases) and [`np.integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#integer-types)   |
| numeric                                                                                   | [`np.float{16, 32, 64, 96, 128}`](https://numpy.org/doc/stable/reference/arrays.scalars.html#sized-aliases) and [`np.floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#floating-point-types)    |
| complex                                                                                   | [`np.complex{64, 96, 128, 256}`](https://numpy.org/doc/stable/reference/arrays.scalars.html#sized-aliases) and [`np.complexfloating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#complex-floating-point-types)    |
| character                                                                               | `object`    |
| character                                                                               | `string`    |
| logical                                                                                  | `bool`    |
| [POSIXct](https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html), [POSIXt](https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html) | `datetime64[ns]` |
| [POSIXct](https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html), [POSIXt](https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html) | `period[*]` |
| [POSIXct](https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html), [POSIXt](https://stat.ethz.ch/R-manual/R-devel/library/base/html/DateTimeClasses.html)              | `timedelta64[ns]`      |

For discussion on Numpy data types, see: https://numpy.org/doc/stable/reference/arrays.scalars.html#sized-aliases

For additional Pandas data types, see: 
https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes

# Conversion Mapping Function

Design a function to convert from Pandas to R data types

In [21]:
import pandas as pd 

def convert_pandas_dtype_to_r(s):
  # Force series to avoid odd output
  series_dtype = s.dtype

  if pd.api.types.is_float_dtype(s):
    return "numeric"
  elif pd.api.types.is_integer_dtype(s):
    return "integer"
  elif pd.api.types.is_object_dtype(s) or pd.api.types.is_string_dtype(s):
    return "character"
  elif pd.api.types.is_categorical_dtype(s):
    # Check if ordered
    if s.cat.ordered:
      return "ordered factor"
    return "factor"
  elif pd.api.types.is_bool_dtype(s):
    return "logical"
  elif pd.api.types.is_complex_dtype(s):
    return "complex"
  elif pd.api.types.is_datetime64_any_dtype(s):
    return "POSIXct"
  elif pd.api.types.is_timedelta64_dtype(s) or pd.api.types.is_period_dtype(s):
    return "Not supported"

  return "Unknown"
  
dft.agg([convert_pandas_dtype_to_r])

Unnamed: 0,integer,numeric,logical,character,complex,numeric-list,integer-list,complex-list,character-list,logical-list,character-string-list,POSIXct-POSIXt-timestamp,POSIXct-POSIXt-date_range,factor,ordered-factor
convert_pandas_dtype_to_r,integer,numeric,logical,character,complex,numeric,integer,complex,character,logical,character,POSIXct,POSIXct,factor,ordered factor


# Reticulate

Note, this must be run outside of a Jupyter Notebook!

In [None]:
%%R 

# Install reticulate
#install.packages("reticulate")

# Create a conda environment
reticulate::conda_create("r-reticulate")

# Install the pandas package
reticulate::conda_install("r-reticulate", "pandas")

# Load the environment
reticulate::use_condaenv("r-reticulate")

# Run
reticulate::source_python('dtype-test.py')


    consider that it could be called from a Python process. This
    results in a quasi-obligatory segfault when rpy2 is evaluating
    R code using it. On the hand, rpy2 is accounting for the
    fact that it might already be running embedded in a Python
    process. This is why:
    - Python -> rpy2 -> R -> reticulate: crashes
    - R -> reticulate -> Python -> rpy2: works

    The issue with reticulate is tracked here:
    https://github.com/rstudio/reticulate/issues/208
    





Error: Unable to find conda binary. Is Anaconda installed?


RInterpreterError: ignored