### R in Jupyter Warm-Up

We're going to be using R packages in this course, but from in Jupyter notebooks.  

[rpy2](https://rpy2.github.io/doc/latest/html/index.html) is a Python package that provides interfaces to R packages. It makes it possible to run R embedded in Python.

As a warm-up exercise we're going to get set up to import R packages.  Then we're going to import a simple data set, summarize the data in it, and fit a basic linear regression model.

### Some R Documentation

R downloads, source, packages, resources, etc. can be found at:  

[R Project](https://www.rproject.org) (Be sure to check out the Task Views in a CRAN mirror)

Using R from within Python and/or Jupyter can be a little cumbersome in terms of accessing R documentation.  Here are a couple places you can find documentation:  

[R Project Documentation](https://www.r-project.org/other-docs.html)  

[R Documentation](https://www.rdocumentation.org/)

[R Manuals in html and pdf](https://stat.ethz.ch/R-manual/)  

[R Package Documentation](https://rdrr.io/)  

[RDocumentation](https://www.rdocumentation.org/)

First, some preliminary installs and checks.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import pandas as pd
import numpy as np

In [2]:
# This bit widens all cells in this Notebook
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

In [3]:
# This can just be commented out.  It's a kluge for one of my environ
#import os
#os.environ['R_HOME']='/home/lynd/anaconda36/lib/R'

In [4]:
import rpy2
rpy2.__version__
from rpy2.rinterface import R_VERSION_BUILD
print(R_VERSION_BUILD)

'2.8.5'

('3', '4.1', '', 72865)


Importing the top level interface sub-package:

In [5]:
import rpy2.robjects as robjects

Getting the R base packages and utilities:

In [6]:
from rpy2.robjects.packages import importr
base=importr('base')
utils=importr('utils')
stats=importr('stats')

Names in R can contain a dot, '.'  Dots have special meaning in Python, however. One of the things that `importr` does is to convert R dots into underscores.

Get a list (a matrix, actually) of available packages:

In [12]:
import rpy2.interactive as r
import rpy2.interactive.packages
rAvailPacks=r.packages.packages
# patMat=rAvailPacks.utils.available_packages()

In [8]:
type(patMat)
tuple(patMat.dim)

rpy2.robjects.vectors.Matrix

(12695, 17)

There's a way to get R's help functions:

In [None]:
import rpy2.robjects.help as rh
base_help=rh.Package('base')  # Using the Package class
rSum=base_help.fetch('sum')
rSum.sections.keys()
print(rSum.to_docstring(('description',)))

### Loading the Data

Let's input the data into a `pandas` DataFrame.  Assuming that it's in the current working directory, 

In [10]:
patSatDF=pd.read_csv('data/DECART-patSat.csv')

In [11]:
patSatDF.dtypes

caseID    int64
patSat    int64
q2        int64
q3        int64
q4        int64
q5        int64
q6        int64
q7        int64
q8        int64
q9        int64
ptCat     int64
dtype: object

### Converting a pandas DataFrame to an R dataframe

We're going to convert this Pandas DataFrame to R dataframe:

In [None]:
from rpy2.robjects import pandas2ri
pandas2ri.activate()  # this is necessary

In [None]:
rpatSatDF=pandas2ri.py2ri(patSatDF)  # from pandas DataFrame to R dataframe

In [None]:
type(rpatSatDF)

In [None]:
print(rpatSatDF.head())

### Creating an unordered R Factor

The `ptCat` variable is an unordered categorical variable, or an unordered factor in R.  So let's create a factor version of it in this dataframe:

In [None]:
asFactor=robjects.r['as.factor']   # This gets the R function as.factor()

In [None]:
type(asFactor)

In [None]:
ptType=asFactor(rpatSatDF.rx2('ptCat')) # picking out the ptCat column

# The .rx2 part here is a "delegator" that permits R style row or col selection
# When .rx() or .rx2() are used, indexing starts at 1, and not 0 like in Python

In [None]:
type(ptType)

In [None]:
rlevels=base.levels # This is R's levels() function
                    # Or, rlevels=robjects.r['levels']

In [None]:
rlevels(ptType)

In [None]:
rTable=base.table   #R's table function
                    # Alternatively, rTable=robjects.r['table']  

In [None]:
rTable(ptType)

Next, add this new factor to the dataframe rpatSatDF as a column:

In [None]:
rpatSatDF=robjects.r.cbind(rpatSatDF,ptType=ptType)

In [None]:
rpatSatDF.names

In [None]:
print(rpatSatDF.head())

### Now, a Simple Linear Regression Model

In [None]:
formula='patSat~q2+q6+ptType'

In [None]:
test_fit=stats.lm(formula,data=rpatSatDF)  

# note that the lm() function is in the stats namespace

In [None]:
test_summary=base.summary(test_fit)

In [None]:
print(test_summary.rx2('coefficients'))

In [None]:
print(test_summary.names)

In [None]:
print(test_summary.rx('coefficients'))   # Here's an example of using the .rx() delegator
print(test_summary.rx('r.squared'))

###  Regression Dx Plots

In [None]:
rPlot=robjects.r['plot']  # get the plot() method
rGraphsOff=robjects.r["graphics.off"]

In [None]:
rPlot(test_fit) # default regression Dx plots
rGraphsOff()


### Saving Results

Your results can be saved in a Python way, e.g. by pickling, by using the shelve package.  

You can also save things in an "R Way."


**EXERCISE** Add the predictor variables q3, q4, q4, q5, q7, q8, and q9 to the above regression model.  

Question:  How would you assess the extent of multicollinearity amongst the predictors?  

Question:  How might you determine if any of your coefficient estimates are biased due to endogeneity?