<a href="https://colab.research.google.com/github/dowes48/LabRep_R/blob/main/R_version_AbLabs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wrangling Data from Laboratory Reports - Google Colab R Session

> *Let me remind the reader that Abalone Labs and Gottagetta Life are mythical entities. The names, identifiers, and test results in the AbLab print files were all fabricated from random values.*


When you run one of the following Colab code cells for the first time, Google will warn you that this is not a Google notebook. Choose the option "Run Anyway".

For example, **Click on the arrow in the next cell** to determine what version of R Google is using (along with other details), then choose the option "Run Anyway".  

*one can also run a code cell by placing the cursor in the cell and keying "Shift-Enter"*

In [None]:
R.version

I am using GitHub to store this Notebook and its associated data files. If you got this far, you already have a Colab copy of my notebook. Now you need to do next is to click the arrow in the next cell to make a temporary copy of my entire GitHub repository including the data files that we will want to use.

In [None]:
system("git clone https://github.com/dowes48/LabRep_R")

After running the previous cell, click on the directory icon located in the panel to the left of this Notebook page. You will see a new directory titled "LabReports". Open the LabReports directory, then the AbLab_Rpts subdirectory. Under its subdirectories, you will find the target files, all with ".prn" file extensions. Double-click to open your choice of .prn files. You will see the file contents in a panel to the right of the page.

---
Note: this notebook is ephemeral as are the cloned repository files. They will be deleted sometime after you have finished using the Colab notebook.

We will also need functions from the "tidyverse" package.

In [None]:
install.packages("tidyverse")

### First Steps

Let's start by creating a list of all ".prn" files in the target directory and its sub-directories. This simple exercise does nothing but verify we have a systematic way to acquire each file for opening later on. The output will be a listing of each directory and each file name in that directory.

---

**Before running the following code**, place your cursor in the code cell and locate the cell's context menu. All Colab cells have a context menu that can be located at the top right corner of the cell. "Context" means the menu choices are specific to the cell type. Choose **Explain Code** either from the 3 dot drop down menu or by clicking the pencil icon. Gemini, Google's branded AI, **may** generate a thorough explanation of the code in the right pane. I say **may** because sometimes Gemini returns a lazy summary without helpful details.

---

Now run the cell. Note that unlike "walking" the directory in Python, here R returns just a list of full filepath names - but that is all we will need.

After you are finished viewing the verbose ouput from this code cell, you may want to clear it by choosing "**Clear Output**" from the context menu. The "**Clear all outputs**" in Edit menu (upper left) will do exactly that, clear all outputs.

In [None]:
rm(list=ls())
library(stringr)

TARGETDIR <- "LabRep_R/AbLab_Rpts"
in_pattern <- "AbLabs_.+\\.prn"

file_list <- list.files(path=TARGETDIR, pattern=in_pattern, recursive=TRUE)
for(full_name in file_list) {
  print(full_name)
}

Note the variable "TARGETDIR" above. Since I won't be changing its value, it is essentially a constant. I use all caps for such variables.

---

The above code demonstrated how to traverse the directory tree and create a list of the file pathnames. Now let's open each file and "do something", but keep it simple for now. We will take advantage of the fact that each lab report is followed by a form feed character, \f. Counting form feeds will tell us how many reports there are in the .prn files.

In [15]:
rm(list=ls())
library(stringr)

TARGETDIR <- "LabRep_R/AbLab_Rpts"
in_pattern <- "AbLabs_.+\\.prn"

form_feed <- '\f'
form_feed_count <- 0
line_count <- 0

process_lines <- function(lines){
    for (line in lines) {
        line_count <<- line_count + 1
        if(str_detect(line,form_feed)) {
            form_feed_count <<- form_feed_count + 1
        }
    }
}

file_list <- list.files(path=TARGETDIR, pattern=in_pattern, recursive=TRUE)
for(fname in file_list) {
    in_file <- file(file.path(TARGETDIR,fname), "r")
    in_lines_list <- readLines(in_file)
    close(in_file)
    process_lines(in_lines_list)
}

print(line_count)
print(form_feed_count)
# print(in_lines_list)


[1] 1755160
[1] 43837


The line and form feed counts agree with the counts obtained with Python code. However, the R script required 53 seconds to complete its mission versus 1 second in Python