<a href="https://colab.research.google.com/github/arangoml/arangopipe/blob/master/examples/R_Example_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Overview

This notebook provides an overview of using Arangopipe with your R projects. In this notebook, a simple illustrative example of using the arangopipe package to store meta-data about model development activity done using R is provided. To run this notebook, first install the notebook extension to R with jupyter using:
```conda install -c r r-irkernel```

The cells below provide the step-by-step instructions to develop a regression model for the california housing dataset using R and then using Arangopipe to store the meta-data about the results. 

In [1]:
# Install Required packages for reading the data file
install.packages("readr",repos = "http://cran.rstudio.com/")
install.packages("RCurl", repos = "http://cran.rstudio.com/")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



### load the library and read the data file

In [2]:
library(readr)
library(RCurl)
fp <- "https://raw.githubusercontent.com/arangoml/arangopipe/master/arangopipe/tests/CItests/cal_housing.csv"
df <- read.csv(fp)

### List the data types

In [3]:
str(df)

'data.frame':	20639 obs. of  9 variables:
 $ lat             : num  -122 -122 -122 -122 -122 ...
 $ long            : num  37.9 37.9 37.9 37.9 37.9 ...
 $ housingMedAge   : int  21 52 52 52 52 52 52 42 52 52 ...
 $ totalRooms      : int  7099 1467 1274 1627 919 2535 3104 2555 3549 2202 ...
 $ totalBedrooms   : int  1106 190 235 280 213 489 687 665 707 434 ...
 $ population      : int  2401 496 558 565 413 1094 1157 1206 1551 910 ...
 $ households      : int  1138 177 219 259 193 514 647 595 714 402 ...
 $ medianIncome    : num  8.3 7.26 5.64 3.85 4.04 ...
 $ medianHouseValue: num  358500 352100 341300 342200 269700 ...


### Transform the response variable (don't run the next cell twice!)

In [4]:
# don't run this cell twice, otherwise you will be applying the log transform multiple times.
df$medianHouseValue = log(df$medianHouseValue)

### Generate the test and train datasets

In [5]:
smp_size <- floor(0.667 * nrow(df))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(df)), size = smp_size)

df.train <- df[train_ind, ]
df.test <- df[-train_ind, ]

### Inspect the training dataset

In [6]:
head(df.train)

Unnamed: 0_level_0,lat,long,housingMedAge,totalRooms,totalBedrooms,population,households,medianIncome,medianHouseValue
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>
18847,-122.38,41.43,45,2245,448,1155,421,1.6509,10.74074
18895,-122.24,38.12,39,2967,500,1243,523,4.2902,11.93426
2986,-119.0,35.33,35,991,221,620,207,1.9417,10.89303
1842,-122.29,37.91,40,2085,329,796,339,5.5357,12.51979
3371,-118.28,34.26,32,1079,207,486,167,4.9833,12.26905
11638,-118.06,33.83,22,5290,1054,2812,1021,4.53,12.33006


### Develop the linear model

In [7]:
lm.housing <- lm(medianHouseValue ~ ., data = df.train)

### Generate the test and training predictions

In [8]:
trng.pred <- predict(lm.housing, df.train)
test.pred <- predict(lm.housing, df.test)
rmse.trng <- sqrt((sum(df.train$medianHouseValue - trng.pred)^2)/nrow(df.train))
rmse.test <- sqrt((sum(df.test$medianHouseValue - test.pred)^2)/nrow(df.test))

### Summarize the model developed

In [9]:
summary(lm.housing)


Call:
lm(formula = medianHouseValue ~ ., data = df.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4299 -0.2056  0.0019  0.1950  3.2533 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -1.217e+01  3.758e-01 -32.378  < 2e-16 ***
lat           -2.798e-01  4.288e-03 -65.265  < 2e-16 ***
long          -2.817e-01  4.050e-03 -69.551  < 2e-16 ***
housingMedAge  3.071e-03  2.589e-04  11.865  < 2e-16 ***
totalRooms    -3.356e-05  4.708e-06  -7.128 1.07e-12 ***
totalBedrooms  4.604e-04  4.091e-05  11.253  < 2e-16 ***
population    -1.624e-04  6.240e-06 -26.031  < 2e-16 ***
households     2.473e-04  4.416e-05   5.602 2.16e-08 ***
medianIncome   1.780e-01  2.004e-03  88.799  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3399 on 13757 degrees of freedom
Multiple R-squared:  0.6421,	Adjusted R-squared:  0.6419 
F-statistic:  3086 on 8 and 13757 DF,  p-value: < 2.2e-16


### Set up to save the model meta-data to Arangopipe by installing the reticulate library

In [10]:
install.packages("reticulate")


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [11]:
library("reticulate")
miniconda_update(path = miniconda_path())


1. Load the library
2. Set up a python environment for this project (mini-conda)
3. Install Arangopipe and dependencies in the environment

In [12]:
conda_create("r-reticulate")
py_install(env = "r-reticulate", packages = c("arangopipe==0.0.6.9.5",
                                              "python-arango","pandas",
                                              "PyYAML==5.1.1", "sklearn2",
                                              "yapf", "autopep8"),pip = TRUE)

In [13]:
system("git clone -b r_example_arangopipe https://github.com/arangoml/arangopipe.git")
#

### Use a python connector to set up an Arangopipe connection

In [14]:
conn_params <-list()
conn_params$DB_service_host <- "arangoml.arangodb.cloud"
conn_params$DB_end_point <- "createDB"
conn_params$DB_service_name <- "createDB"
conn_params$DB_service_port <- '8529'
conn_params$conn_protocol <- 'https'

In [15]:
conn_params

In [16]:
source_python('arangopipe/examples/arangopipe_conn.py', convert = TRUE)

In [17]:
apcon <- conn_arangopipe(conn_params)

In [18]:
ap <- apcon$ap
ap_admin <- apcon$ap_admin

In [19]:
proj_info <- list()
proj_info$name <- "R_Arangopipe_Connection_Test"
proj_reg <- ap_admin$register_project(proj_info)

In [20]:
proj_reg

In [21]:
# source_python('arangopipe_conn.py', convert = TRUE)

# ap <- conn_arangopipe()
# ap$lookup_entity("Context_Manager_Test", "project")

### Register the dataset

In [22]:
ds_info <- list("name" = paste("california-housing-dataset", Sys.time(), sep = "-"),
            "description" = "This dataset lists median house prices in Califoria. Various house features are provided",
           "source" = "UCI ML Repository" )

In [23]:
ds_reg <- ap$register_dataset(ds_info)

In [24]:
ds_reg

### Generate the featureset meta-data 

In [25]:
f.info <- sapply(df, class)

In [26]:
f.info["name"] <- paste("logTransformedFeatureset", Sys.time(),sep="-")


In [27]:
f.info <- as.list(f.info)

### Register the featureset

In [28]:
fs_reg <- ap$register_featureset(f.info, ds_reg$`_key`)

### Generate the model meta-data

In [29]:
model_info <- list()
model_info["name"] <- paste("R_Linear_Regression_Model_Housing_Data", Sys.time(),sep="-")


### Register the model meta-data

In [30]:
model_reg <- ap$register_model(model_info, project = "R_Arangopipe_Connection_Test")

### Set up the data structures to capture modeling meta-data summary

In [31]:
run_info = list()

In [32]:
b1 = ISOdate(2020,11,13)
b2 = Sys.time()
uuid <- as.character(as.integer(difftime(b2,b1,units='mins')))
run_info["run_id"] <- uuid

In [33]:
model.params.data = list()
model.params.data["name"] = "Linear_Model"
model.params.data["Intercept"] = "True"

model.params = list()
model.params$`run_id` = uuid
model.params$`model_params` = model.params.data

In [34]:
ms <- summary(lm.housing)
model.perf.summary <- list()
model.perf.summary["run_id"] = uuid
model.perf.summary["r.squared"] = ms$r.squared
model.perf.summary["adj.r.squared"] = ms$adj.r.squared
model.perf.summary["timestamp"] = Sys.time()

In [35]:
model.perf.summary

In [36]:
run_info["dataset"] = ds_reg$`_key`
run_info["featureset"] = fs_reg$`_key`
run_info["model"] = model_reg$`_key`
run_info$`model-params` = model.params
run_info$`model-perf` =  model.perf.summary
run_info["tag"] = "R_Arangopipe_Connection_Test"
run_info["project"] = "R_Arangopipe_Connection_Test"

### Log the model meta-data

In [37]:
ri <- ap$log_run(run_info)

### We are done!  You can set up your R projects to use Arangopipe in a similar manner.