# Wrangling Data from Laboratory Reports - Google Colab Session

Google Colab gives you the opportunity to try your hand at Python coding. Colab brings Python and a variety of packages.

When you run Colab code for the first time, Google will warn you that it is not a Google notebook. Choose the option "Run Anyway".

I am using Github to store the Notebook (this document) and the neccessary data files. If you got this far, you already have a Colab copy of the notebook. Now you need to **click the arrow in the next cell** to make a temporary copy of my entire GitHub repository including the data files that we want. (one can also run a code cell by placing the cursor in the cell and keying "Shift-Enter".

In [None]:
!git clone "https://github.com/dowes48/LabReports"

After running the previous cell, click on the directory icon located in the panel to the left of this screen. You will see a new directory titled "LabReports". Open the LabReports directory, then the AbLab_Rpts subdirectory. Under its subdirectories, you will find the target files, all with "prn" file extensions. Double-click to open one of the prn files. You will see its contents in a panel to the right of the Notebook. Unfortunately, this view is of limited use because it stops at the first form feed (FF) character.

This notebook is ephemeral as are the cloned repository files. They will be deleted sometime after you've finished. When you run the previous cell again at a later opportunity, you may get an error msg saying you may already have the files. This is only because Google has not yet deleted them.



## Overall Strategy

We need to program to visit each of the subdirectories, open each prn file, and then process the file line by line while storing field values in a buffer. Since there are multiple lab reports in most of the prn files, the program will need to identify form feed characters and flush the buffer to a pipe delimited text file and then starting with the next report.

> *Let me remind the reader that Abalone Labs and Gottagetta Life are mythical entities; the names and identifiers for all individuals are fabricated from random values, as are the lab test results. The processing of these files is useful only for teaching purposes - any analysis of the results would be meaningless.*

I prefer pipes (|) over commas because they are easier for me to read and it is far less likely for a stray pipe in the text to interfere with importing the pipe separated values (psv).

### First Steps

Let's start by practicing a "**walk**" through the target directory and its sub-directories. This simple exercise does nothing but verify we have a systematic way to visit each file. The output will be a listing of each directory and each file name in that directory. After viewing the ouput,  choose "Clear Output" from the context menu located at the top right edge of the cell.

When the cursor is in a code cell (like that which follows), the context menu offers a very helpful choice, "Explain Code". The AI generated explanations may be very helpful to you.

In [None]:
TARGETDIR = '/content/LabReports/AbLab_Rpts'
#TARGETDIR = './AbLab_Rpts'
import os

print(f"Walking through directory: {os.path.abspath(TARGETDIR)}\n")
# Iterate over the 3-tuple generated by os.walk()
for dirpath, dirnames, filenames in os.walk(TARGETDIR):
    # Print the current directory path
    print(f"Current Directory: {dirpath}")

    # List subdirectories found
    if dirnames:
        print(f"  Subdirectories: {', '.join(dirnames)}")

    # Iterate over files in the current directory
    for filename in filenames:
        # Construct and print the full path of each file
        full_file_path = os.path.join(dirpath, filename)
        print(f"  Found file: {full_file_path}")
    print("-" * 40)


Note the variable "TARGETDIR" above. Since I won't be changing its value, it is essentially a constant. I use all caps for such variables.
Also note the two *for* loops. There are other loops available in Python, but the construct *for xx in yy* is the most Pythonic.
The *os.walk()* function returns a three-tuple. See Prof Downey's text for an explanation of tuples.
The "f" prefix indicates a formatted string literal and allows for easy interpolation of variable values in the printed string.

The above code demonstrates how to traverse the directory tree and touch each file. Now let's open each file and "do something", but keep it simple for now. We will take advantage of the fact that each lab report is followed by a form feed character, "\f". Counting form feeds will tell us how many reports to expect, so this is a useful exercise.

Note that I do not need to re-state the TARGETDIR or import os again. This notebook has access to those values from the previous code cell.

In [3]:
form_feed = '\f'
form_feed_count = 0
line_count = 0
file_count = 0

def process_line(f_in):
    global line_count, form_feed_count
    for line in f_in:
        line_count += 1
        if form_feed in line:
            form_feed_count += 1

for dirpath, dirnames, filenames in os.walk(TARGETDIR):
    for filename in filenames:
        file_count += 1
        full_file_path = os.path.join(dirpath, filename)
        file_in = open(full_file_path, 'r')
        process_line(file_in)
        file_in.close()

print(f"All {file_count:,} .prn files were opened")
print(f"A total of {line_count:,} lines were searched for a form feed.")
print(f"There are {form_feed_count:,} lab test reports available for processing.")


All 334 .prn files were opened
A total of 1,755,160 lines were searched for a form feed.
There are 43,837 lab test reports available for processing.


Let's face it, those are impressive numbers. More importantly, we continue to develop a framework for our goal of capturing lab results.

The previous code cell includes two new features: 1) a custom function, process_line(), was defined; 2) the concept of "scope" was introduced. See Downey for a full explanation of function definitions.

Note that I need to have the "count" variables available at the main level in order to print their values to the screen. At print time, they need to be outside the scope of the function. Thus, in order to access those variables from within the function definition, I need to declare them inside the function as global variables. Otherwise, they would be local to the function, local variables that coincidentally had the same names as those outside the function.

Scope will come into play again as we expand the code to incoporate:
*   an output file for collecting lab results
*   a buffering system that uses a Python dictionary to collect results from each lab report instance

This can get complicated, so it is best to proceed incrementally. Let's start by narrowing our attention to one of the sub-directories that has only a handful of prn files. Let's open a file for output, write a short series of column headers to it, and then write a small number of values before closing the output file (I try to remember to close both input and output files).

Two new Python types - lists and dictionaries. If you are not familiar with these, please review chapters 9 and 10 in Downey.



In [5]:
#TARGETDIR = r'./AbLab_Rpts/AbLab_2018-20'
#OUTPUTDIR = r'./Output'
TARGETDIR = r'/content/LabReports/AbLab_Rpts/AbLab_2018-20'
OUTPUTDIR = r'/content/LabReports/Output'

import os
import copy

lab_file_out = open(OUTPUTDIR + r'/labs_text_output.txt', 'w')
lab_file_out.write('name|sex|ticket|gluc')

clean_lab_dict = {'name':"", 'sex':"", 'ticket':"", 'gluc':""}

def process_reports(f_in):
    lab_dict = copy.deepcopy(clean_lab_dict)
    lab_lst = []
    for line in f_in:
        if "\f" not in line:
            if 'NAME:' in line:
                lab_dict['name'] = line[6:].strip()
            if 'DOB/SEX:' in line:
                lab_dict['sex'] = line[8:].strip()
            if 'TICKET NUMBER:' in line:
                lab_dict['ticket'] = line[15:25]
            if 'GLUCOSE (MG/DL' in line:
                lab_dict['gluc'] = line[30:40].strip()
        else:
            for k, v in lab_dict.items():
                lab_lst.append(v)
            lab_file_out.write("|".join(lab_lst) + '\n')
            lab_dict = copy.deepcopy(clean_lab_dict)
            lab_lst = []

for dirpath, dirnames, filenames in os.walk(TARGETDIR):
    # Iterate over files in the current directory
    for filename in filenames:
        # Construct and print the full path of each file
        full_file_path = os.path.join(dirpath, filename)
        file_in = open(full_file_path, 'r')
        process_reports(file_in)
        file_in.close()

lab_file_out.close()


FileNotFoundError: [Errno 2] No such file or directory: 'content/LabReports/Output/labs_text_output.txt'