# Wrangling Data from Laboratory Reports - Google Colab Session

Google Colab gives you the opportunity to try your hand at Python coding. Colab brings Python and a variety of packages.

I am using Github to store the Notebook (this document) and the neccessary data files. If you got this far, you already have a Colab copy of the notebook. Now you need to **click the arrow in the next cell** to make a temporary copy of my entire GitHub repository including the data files that we want. (one can also run a code cell by placing the cursor in the cell and keying "Ctrl-Enter".

In [None]:
!git clone "https://github.com/dowes48/LabReports"

After running the previous cell, click on the directory icon located in the panel to the left of this screen. You will see a new directory titled "LabReports". Open the LabReports directory, then the AbLab_Rpts subdirectory. Under its subdirectories, you will find the target files, all with "prn" file extensions. Double-click to open one of the prn files. You will see its contents in a panel to the right of the Notebook. Unfortunately, this view is of limited use because it stops at the first form feed (FF) character.

This notebook is ephemeral as are the cloned repository files. They will be deleted sometime after you've finished. When you run the previous cell again at a later opportunity, you may get an error msg saying you may already have the files. This is only because Google has not yet deleted them.



## Overall Strategy

We need to program to visit each of the subdirectories, open each prn file, and then process the file line by line while storing field values in a buffer. Since there are multiple lab reports in most of the prn files, the program will need to identify form feed characters and flush the buffer to a pipe delimited text file and then starting with the next report.

I prefer pipes (|) over commas because they are easier for me to read and it is far less likely for a stray pipe in the text to interfere with importing the pipe separated values (psv).

### First Steps

Let's start by practicing a "**walk**" through the target directory and its sub-directories. This simple exercise does nothing but verify we have a systematic way to visit each file. The output will be a listing of each directory and each file name in that directory. When the output is no longer needed, you can right-click the output cell and choose "Clear Output".

In [3]:
TARGETDIR = r'/content/LabReports/AbLab_Rpts'
import os

start_directory = TARGETDIR  # Start from the target directory

print(f"Walking through directory: {os.path.abspath(start_directory)}\n")
# Iterate over the 3-tuples generated by os.walk()
for dirpath, dirnames, filenames in os.walk(start_directory):
    # Print the current directory path
    print(f"Current Directory: {dirpath}")

    # List subdirectories found
    if dirnames:
        print(f"  Subdirectories: {', '.join(dirnames)}")

    # Iterate over files in the current directory
    for filename in filenames:
        # Construct and print the full path of each file
        full_file_path = os.path.join(dirpath, filename)
        print(f"  Found file: {full_file_path}")
    print("-" * 40)


Walking through directory: /content/LabReports/AbLab_Rpts

Current Directory: /content/LabReports/AbLab_Rpts
  Subdirectories: AbLab_2018-20, AbLab_2022, AbLab_2024, AbLab_2021, AbLab_2023
----------------------------------------
Current Directory: /content/LabReports/AbLab_Rpts/AbLab_2018-20
  Found file: /content/LabReports/AbLab_Rpts/AbLab_2018-20/AbLabs_20180914.prn
  Found file: /content/LabReports/AbLab_Rpts/AbLab_2018-20/AbLabs_20190621.prn
  Found file: /content/LabReports/AbLab_Rpts/AbLab_2018-20/AbLabs_20200522.prn
  Found file: /content/LabReports/AbLab_Rpts/AbLab_2018-20/AbLabs_20200422.prn
  Found file: /content/LabReports/AbLab_Rpts/AbLab_2018-20/AbLabs_20201002.prn
  Found file: /content/LabReports/AbLab_Rpts/AbLab_2018-20/AbLabs_20180316.prn
----------------------------------------
Current Directory: /content/LabReports/AbLab_Rpts/AbLab_2022
  Found file: /content/LabReports/AbLab_Rpts/AbLab_2022/AbLabs_20220829.prn
  Found file: /content/LabReports/AbLab_Rpts/AbLab_202