# Using lakeFS with R - NYC Filming Permits

<img src="https://docs.lakefs.io/assets/logo.svg" alt="lakeFS logo" height=100/>  <img src="https://www.r-project.org/logo/Rlogo.svg" alt="R logo" width=50/>

lakeFS interfaces with R in two ways: 

* the [S3 gateway](https://docs.lakefs.io/understand/architecture.html#s3-gateway) which presents a lakeFS repository as an S3 bucket. You can then read and write data in lakeFS using standard S3 tools such as the `aws.s3` library.
* a [rich API](https://docs.lakefs.io/reference/api.html) for which can be accessed from R using the `httr` library. Use the API for working with branches and commits.

_**Learn more about lakeFS in the [Quickstart](https://docs.lakefs.io/quickstart/) and support for R in the [documentation](https://docs.lakefs.io/integrations/r.html)**_

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [1]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [2]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [3]:
repo_name = "using-r-with-lakefs"

### Variables

In [4]:
# aws.s3 library uses these environment variables
# Some, such as region, need to be specified in the function call 
# and are not taken from environment variables.
# See https://github.com/cloudyr/aws.s3/blob/master/man/s3HTTP.Rd for
# full list of configuration parameters when calling the s3 functions.
lakefsEndPoint_no_proto <- sub("^https?://", "", lakefsEndPoint)
lakefsEndPoint_proto <- sub("^(https?)://.*", "\\1", lakefsEndPoint)
if (lakefsEndPoint_proto == "http") {
    useHTTPS <- "false"
} else {
    useHTTPS <- "true"
}

Sys.setenv("AWS_ACCESS_KEY_ID" = lakefsAccessKey,
           "AWS_SECRET_ACCESS_KEY" = lakefsSecretKey,
           "AWS_S3_ENDPOINT" = lakefsEndPoint_no_proto)

# Set the API endpoint
lakefs_api_url<- paste0(lakefsEndPoint,"/api/v1")

### Libraries

In [5]:
library(aws.s3)
library(httr)
library(arrow)


Attaching package: ‘arrow’


The following object is masked from ‘package:utils’:

    timestamp




### Set up S3FileSystem for Arrow access to lakeFS

In [6]:
lakefs <- S3FileSystem$create(
    endpoint_override = lakefsEndPoint,
    access_key = lakefsAccessKey, 
    secret_key = lakefsSecretKey, 
    region = "",
    scheme = "http"
)

#### Verify lakeFS credentials by getting lakeFS version

In [7]:
r=GET(url=paste0(lakefs_api_url,"/config/version"), authenticate(lakefsAccessKey, lakefsSecretKey))

In [8]:
print("Verifying lakeFS credentials…")
if (r$status_code == 200) {
    print(paste0("…✅lakeFS credentials verified. ℹ️lakeFS version ",content(r)$version))   
} else {
    print("🛑 failed to get lakeFS version")
    print(content(r)$message)
}

[1] "Verifying lakeFS credentials…"
[1] "…✅lakeFS credentials verified. ℹ️lakeFS version 0.104.0"


### Define lakeFS Repository

In [9]:
r=GET(url=paste0(lakefs_api_url,"/repositories/",repo_name), authenticate(lakefsAccessKey, lakefsSecretKey))

In [10]:
if (r$status_code ==404) {
    print(paste0("Repository ",repo_name," does not exist, so going to try and create it now."))

    body=list(name=repo_name, storage_namespace=paste0(storageNamespace,"/",repo_name))

    r=POST(url=paste0(lakefs_api_url,"/repositories"), 
           authenticate(lakefsAccessKey, lakefsSecretKey),
           body=body, encode="json" )

    if (r$status_code <400) {
        print(paste0("🟢 Created new repo ",repo_name," using storage namespace ",content(r)$storage_namespace))
    } else {
        print(paste0("🔴 Failed to create new repo: ",r$status_code))
        print(content(r)$message)
    }
    
} else if (r$status_code == 201 || r$status_code == 200) {
    print(paste0("Found existing repo ",repo_name," using storage namespace ",content(r)$storage_namespace))
} else {
    print(paste0("🔴 lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
    print(r)
}

[1] "Repository using-r-with-lakefs does not exist, so going to try and create it now."
[1] "🟢 Created new repo using-r-with-lakefs using storage namespace s3://example/using-r-with-lakefs"


---

## Main demo starts here 🚦 👇🏻

### Load NYC Film Permits data from JSON

In [11]:
library(jsonlite)

In [12]:
nyc_data <- fromJSON("/data/nyc_film_permits.json")

### Show a sample of the data

In [13]:
str(nyc_data)

'data.frame':	1000 obs. of  14 variables:
 $ eventid         : chr  "691875" "691797" "691774" "691762" ...
 $ eventtype       : chr  "Shooting Permit" "Shooting Permit" "Shooting Permit" "Shooting Permit" ...
 $ startdatetime   : chr  "2023-01-20T06:00:00.000" "2023-01-20T09:00:00.000" "2023-01-20T11:30:00.000" "2023-01-20T02:30:00.000" ...
 $ enddatetime     : chr  "2023-01-20T22:00:00.000" "2023-01-21T01:00:00.000" "2023-01-21T01:00:00.000" "2023-01-20T23:00:00.000" ...
 $ enteredon       : chr  "2023-01-18T14:34:06.000" "2023-01-18T11:48:09.000" "2023-01-18T10:47:25.000" "2023-01-18T09:57:45.000" ...
 $ eventagency     : chr  "Mayor's Office of Film, Theatre & Broadcasting" "Mayor's Office of Film, Theatre & Broadcasting" "Mayor's Office of Film, Theatre & Broadcasting" "Mayor's Office of Film, Theatre & Broadcasting" ...
 $ parkingheld     : chr  "31 STREET between 47 AVENUE and 48 AVENUE" "3 AVENUE between BROOK AVENUE and EAST  162 STREET,  BROOK AVENUE between 3 AVENUE and EAST

In [14]:
table(nyc_data$borough)


        Bronx      Brooklyn     Manhattan        Queens Staten Island 
           28           334           463           168             7 

### Write the data to `main` branch (using `aws.s3`)

In [15]:
branch <- "main"
aws.s3::s3saveRDS(x = nyc_data,
                  object = paste0(branch,"/nyc/","nyc_permits.R"), 
                  bucket = repo_name, 
                  region="",
                  use_https=useHTTPS)

#### List uncommitted changes on `main`

In [16]:
r=GET(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches/",branch,"/diff"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=body, encode="json" )

In [17]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    str((content(r)$results))
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

[1] "👏🏻 lakeFS API call succeeded (200)"
List of 1
 $ :List of 4
  ..$ path      : chr "nyc/nyc_permits.R"
  ..$ path_type : chr "object"
  ..$ size_bytes: int 51802
  ..$ type      : chr "added"


#### Commit the data to `main`

In [18]:
body=list(message="Initial data load", 
          metadata=list(
              client="httr", author="rmoff"))

r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches/",branch,"/commits"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=body, encode="json" )

In [19]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    content(r)
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

[1] "👏🏻 lakeFS API call succeeded (201)"


### Create a new branch on which to experiment with the data

In [20]:
branch <- "dev"

In [21]:
r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=list(name=branch, source="main"), 
       encode="json" )

In [22]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    content(r)
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

[1] "👏🏻 lakeFS API call succeeded (201)"


### Show a sample of the data from `dev` branch to show that it's the same

In [23]:
nyc_data_dev <- aws.s3::s3readRDS(object = paste0(branch,"/nyc/","nyc_permits.R"), 
                                  bucket = repo_name, 
                                  region="",
                                  use_https=useHTTPS)

In [24]:
table(nyc_data_dev$borough)


        Bronx      Brooklyn     Manhattan        Queens Staten Island 
           28           334           463           168             7 

### Delete some of the data

In [25]:
nyc_data_dev <- nyc_data_dev[nyc_data_dev$borough != "Manhattan", ]

In [26]:
table(nyc_data_dev$borough)


        Bronx      Brooklyn        Queens Staten Island 
           28           334           168             7 

### Write it back to object store in Parquet format

In [27]:
write_parquet(x = nyc_data_dev,
              sink = lakefs$path(paste0(repo_name, "/", branch , "/nyc/nyc_permits.parquet")))

#### Remove the RDS file

In [28]:
lakefs$DeleteFile(paste0(repo_name, "/", branch , "/nyc/nyc_permits.R"))

#### Show uncommitted changes

In [29]:
r=GET(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches/",branch,"/diff"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=body, encode="json" )

In [30]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    str((content(r)$results))
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

[1] "👏🏻 lakeFS API call succeeded (200)"
List of 3
 $ :List of 4
  ..$ path      : chr "nyc/"
  ..$ path_type : chr "object"
  ..$ size_bytes: int 48278
  ..$ type      : chr "added"
 $ :List of 4
  ..$ path      : chr "nyc/nyc_permits.R"
  ..$ path_type : chr "object"
  ..$ size_bytes: int 48278
  ..$ type      : chr "removed"
 $ :List of 4
  ..$ path      : chr "nyc/nyc_permits.parquet"
  ..$ path_type : chr "object"
  ..$ size_bytes: int 48278
  ..$ type      : chr "added"


### Show that the `main` view of the data is unchanged

In [31]:
branch <- "main"
lakefs$ls(path = paste0(repo_name,"/",branch),
          recursive = TRUE)

In [32]:
nyc_data <- aws.s3::s3readRDS(object = paste0(branch,"/nyc/","nyc_permits.R"), 
                                  bucket = repo_name, 
                                  region="",
                                  use_https=useHTTPS)

table(nyc_data$borough)


        Bronx      Brooklyn     Manhattan        Queens Staten Island 
           28           334           463           168             7 

### Commit the data to the branch

In [33]:
branch <- "dev"

body=list(message="remove data for Manhattan, write as parquet, remove original file", 
          metadata=list(
              client="httr", author="rmoff"))

r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches/",branch,"/commits"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=body, encode="json" )

In [34]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    content(r)
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

[1] "👏🏻 lakeFS API call succeeded (201)"


### Merge the branch into `main`

In [35]:
r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/refs/",branch,"/merge/main"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=list(message="merge changes from dev back to main branch"), encode="json" )

In [36]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    content(r)
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

[1] "👏🏻 lakeFS API call succeeded (200)"


### Show that the `main` view of the data is now changed

In [37]:
branch <- "main"
nyc_data <- read_parquet(lakefs$path(paste0(repo_name, "/", branch , "/nyc/nyc_permits.parquet")))

In [38]:
table(nyc_data$borough)


        Bronx      Brooklyn        Queens Staten Island 
           28           334           168             7 