Welcome to chapter one of Methods in Medical Informatics! In this section, we will be exploring how to parse and transform text files. We will be exploring seven different scripts which each illustrate  a different aspect of parsing and transforming text files. Lets begin!

> Disclaimer: The content below is adapted from the book "Methods in Medical Informatics - Fundamental of Healthcare Programming in Perl, Python, and Ruby" by Jules J. Berman. All content is for testing, education, and teaching purposes only. No content will be openly released to the internet. 

# Peeking into Large Files

Text data files may sometimes be quite large with some being gigabytes in length. Most word processors are unable to handles files this large. However, through a simple utility script we can open large text files, extract a sample, and display that sample. The script below will display the first 40 lines from a large text file and then store the first 3000 line in a separate file which can opened with a word processor. Afterward we will explore both the script and the script output in more detail.*

> This script will utilize the file [sample.txt](http://datamine.unc.edu/jupyter/edit/Methods-in-Medical-Informatics-master/sample.txt). This is a text file which contains the article "A machine learning algorithm to increase COVID-19 inpatient diagnostic capacity" represented in XML. Additional information [here](https://datamine.unc.edu/data-files/)

**Description adapted from page 3 of "Methods in Medical Informatics".*

In [None]:
import string
line = input('What file would like to sample? (Please write file name) ')
infile = open(line, 'r', encoding='utf-8')
#outfile = open('data\sample_output.txt', 'w', encoding='utf-8')
for iterations in range(40):
    getline = infile.readline()
    print(getline.rstrip())
for iterations in range(3000):
    getline = infile.readline()
    #outfile.write(getline)
infile.close()
#outfile.close()

## Script Algorithm: Peeking into Large Files

Send a prompt to the monitor asking for the name of the file to sample. Store the file name as a variable.*

In [None]:
import string
line = input('What file would like to sample? (Please write file name) ')

Open the file for reading. Open another file for writing.

In [None]:
infile = open(line, 'r', encoding='utf-8')
#outfile = open('data\sample_output.txt', 'w', encoding='utf-8')

Create a for loop that iterates through the first 40 lines of the text file. Print the line as the script iterate through the file.

In [None]:
for iterations in range(40):
    getline = infile.readline()
    print(getline.rstrip())

Create a for loop that iterates through the first 3000 lines of the text file. Store the first 3000 lines in a separate text file. 

In [None]:
for iterations in range(3000):
    getline = infile.readline()
    #outfile.write(getline)

Close the reading and writing files. 

In [None]:
infile.close()
#outfile.close()

**This section is adapted from section 1.1.1, "Script Algorithm", of pages 3-4 from "Methods in Medical Informatics".*

## Analysis: Peeking into large Files

Even simple scripts occasionally require the user to enter information via the keyboard. In this script, one line is all that is needed to initiate a conversation between the script and the user. A line of text is sent to the monitor, and the script waits until the user enters a reply and presses the Enter key. The reply is captured by the program and stored as a variable. The final product is the first 40 lines of the document displayed, and the first 3000 lines of a document written as the file `sample_output`.*

**This section is adapted from section 1.1.2, "Analysis", of pages 5 in "Methods in Medical Informatics".*

# Paging through Large Text Files

The script below solves the same problem as the first script. However, it takes a different approach. Here the script displays the first 40 lines from any text file, provides an opportunity to quit; if declined, the script displays the next 40 lines, and repeat indefinitely. This provides a quick method to quickly scroll through a file. Afterward we will explore both the script and the script output in more detail.*

> This script will utilize the file [sample.txt](http://datamine.unc.edu/jupyter/edit/Methods-in-Medical-Informatics-master/sample.txt). This is a text file which contains the article "A machine learning algorithm to increase COVID-19 inpatient diagnostic capacity" represented in XML. Additional information [here](https://datamine.unc.edu/data-files/)

**Description adapted from page 5 of "Methods in Medical Informatics".*

In [None]:
import string
line = input('What file would like to sample? (Please write file name) ')
infile = open(line, 'r', encoding='utf-8')
#outfile = open('data\sample_output.txt', 'w', encoding='utf-8')
while True:
    for iterations in range(40):
        print(infile.readline().rstrip())
    response = input('Type QUIT if you want to quit. Otherwise press any key\n')
    if (response == 'QUIT'):
        break
infile.close()
#outfile.close()
#exit()

## Script Algorithm: Paging through Large Text Files

Send a prompt to the monitor asking for the name of a file that you want to read.*

In [None]:
import string
line = input('What file would like to sample? (Please write file name) ')

Open a text file to read. Open another file to write.

In [None]:
infile = open(line, 'r', encoding='utf-8')
#outfile = open('data\sample_output.txt', 'w', encoding='utf-8')

Print the first 40 lines of the file. Prompt the user, asking if he or she would like to quit the program. If user enters "QUIT" after the prompt, exit the program. Other repeat iterating through the next 40 lines/

In [None]:
while True:
    for iterations in range(40):
        print(infile.readline().rstrip())
    response = input('Type QUIT if you want to quit. Otherwise press any key\n')
    if (response == 'QUIT'):
        break

Close all opened files. Exit the program.

In [None]:
infile.close()
#outfile.close()
#exit()

**This section is adapted from section 1.2.1, "Script Algorithm", of pages 5-6 from "Methods in Medical Informatics".*

## Analysis: Paging through Large Text Files

If you want to try this script, be sure to provide the name of a text file at the prompt. Programming languages can open a file for reading, without loading the entire file into memory. When a file is opened for reading, file information can be accessed by sequential line readings, or by direct access to any selected byte location in the file. These operations can be done very quickly. The rate-limiting factor is the speed with which your monitor can display text.*

**This section is adapted from section 1.2.2, "Analysis", of pages 7 from "Methods in Medical Informatics".*

# Extracting Lines that Match a Regular Expression

It would be useful to be able to search for classes of data within large text file. Regular expression (regex) allows you to do so. Regex is a conventional way of describing string patterns. The script below will extract lines of text that match a regular expression. Afterward we will explore both the script and the script output in more detail.*

> This script will utilize the file [sample.txt](http://datamine.unc.edu/jupyter/edit/Methods-in-Medical-Informatics-master/sample.txt). This is a text file which contains the article "A machine learning algorithm to increase COVID-19 inpatient diagnostic capacity" represented in XML. Additional information [here](https://datamine.unc.edu/data-files/)

**Description adapted from pages 7-8 of "Methods in Medical Informatics".*

In [None]:
import string
import re
line = input('What file would you like to search? (Please write file name) ')
regex = input('Enter a word, phrase or regular expression to search.')
regex = regex.rstrip()
infile = open(line, 'r', encoding='utf-8')
#outfile = open('data\sample_output.txt', 'w')
regex_object = re.compile(regex, re.I)
for line in infile:
    m = regex_object.search(line)
    if m:
        print(line)
        #outfile.write(line)
#exit()

## Script Algorithm: Extracting Lines that Match a Regular Expression

Send a prompt to the monitor asking for the name of a file to be searched and the regular expression to search for.*

In [None]:
line = input('What file would you like to search? (Please write file name) ')
regex = input('Enter a word, phrase or regular expression to search.')
regex = regex.rstrip()

Open a text file to read. Open a file to output. 

In [None]:
infile = open(line, 'r', encoding='utf-8')
#outfile = open('data\sample_output.txt', 'w')

Create a variable to hold the regular expression.

In [None]:
regex_object = re.compile(regex, re.I)

Parse through every line of the text file. Whenever a line is encountered that matches the search expression, print it and output the line. 

In [None]:
for line in infile:
    m = regex_object.search(line)
    if m:
        print(line)
        #outfile.write(line)

Exit the file

In [None]:
#exit()

**This section is adapted from section 1.3.1, "Script Algorithm", of page 8 from "Methods in Medical Informatics".*

## Analysis: Extracting Lines Match a Regular Expression

When you try this script, be sure to provide the name of a text file at the prompt. If you do not know how to compose a regular expression, just enter a search word or phrase at the prompt. The script will display every line from the provided file that contains a string that matches your search word or phrase, and will send a copy of the results to an external file.*

**This section is adapted from section 1.2.2, "Analysis", of pages 7 from "Methods in Medical Informatics".*

# Changing Every File in a Subdirectory

String substitution is a common computational task. This task could also have some biomedical application. For instance, maybe you will want to switch every occurrence of the word "tumor" with "tumour" when submitting a manuscript to a British journal. Maybe a calculation, repeated throughout your quality assurance report, was incorrect; you want to substitute the correct number wherever the incorrect number appears. The script below will parse through a file and make a specific substitution for every matching sequence. Afterward we will explore both the script and the script output in more detail.*

> This script will utilize the directory [test_directory](http://datamine.unc.edu/jupyter/tree/Methods-in-Medical-Informatics-master/Test_Directory). This is a text file containing the abstract from the article “COVID-19: what has been learned and to be learned about the novel coronavirus disease”. Additional information [here](https://datamine.unc.edu/data-files/)

**Description adapted from page 10 of "Methods in Medical Informatics".*

In [None]:
import sys 
import os 
import re
filelist = os.listdir('C:\\Users\\ericr\\Documents\\GitHub\\Clinical-Cases-LAIR\\Methods in Medical Informatics\\Test_Directory')
os.chdir('C:\\Users\\ericr\\Documents\\GitHub\\Clinical-Cases-LAIR\\Methods in Medical Informatics\\Test_Directory')
for file in filelist:1.txt
    infile = open(file,'r')
    filestring = infile.read()
    infile.close()
    filestring = re.sub('COVID-19','SO LONG', filestring)
    outfile = open(file,'w')
    outfile.write(filestring)
    outfile.close
print('Substitution Completed')
#exit()

## Script Algorithm: Changing Every File in a Subdirectory

Open a directory for reading*

In [None]:
import sys 
import os 
import re
filelist = os.listdir('C:\\Users\\ericr\\Documents\\GitHub\\Clinical-Cases-LAIR\\Methods in Medical Informatics\\Test_Directory')
os.chdir('C:\\Users\\ericr\\Documents\\GitHub\\Clinical-Cases-LAIR\\Methods in Medical Informatics\\Test_Directory')

For each file in your file list, do the following: open the file, read through every line in the file, make the desired substitution for each matching sequence in each line, and close the file when you're finished.. 

In [None]:
for file in filelist:
    infile = open(file,'r')
    filestring = infile.read()
    infile.close()
    filestring = re.sub('COVID-19','SO LONG', filestring)
    outfile = open(file,'w')
    outfile.write(filestring)
    outfile.close
print('Substitution Completed')
#exit()

**This section is adapted from section 1.4.1, "Script Algorithm", of page 10 from "Methods in Medical Informatics".*

## Analysis: Changing Every File in a Subdirectory

Programming languages provide a simple way to determine the names of the files in a subdirectory. Once the names of the files are determined, it becomes straightforward to pen files, examine the contents of files and transform files. In this case, we explored the `Test_Directory` directory. Then, we parsed through each file and replaced each instance of `COVID-19` with the string `SO LONG`. 

If you were writing your own multifile substitution script, you might want to change a defunct web address wherever it appears in a file, or you might want to change a common spelling error in many files at once. Programming languages typically provide a variety of file operations, including file tests (ie. to determine whether a file exists or whether a directory file is a text file or a binary file), and file stats (descriptive information on the file such as file size, file creation date, or file modification date).*

**This section is adapted from section 1.4.2, "Analysis", of pages 11-12 from "Methods in Medical Informatics".*

# Counting the Words in a File

It is easy to write a short script that counts the words in a file, but it is difficult to do the job to everyone's liking. Depending on the type of text, and the intended use of the word count, the criteria for counting a word may change. There will be occasions when you will want to write your own script that counts words just as you prefer. Here is a minimalist word counting script for text file from the Online Mendelian Inheritance in Man. Afterward we will explore both the script and the script output in more detail.*

> This script will utilize the text file [mim2gene.txt](http://datamine.unc.edu/jupyter/edit/Methods-in-Medical-Informatics-master/mim2gene.txt). This is a text file that details the links between the genes in OMIM and other gene identifiers. Additional information [here](https://datamine.unc.edu/data-files/)

**Description adapted from page 12 of "Methods in Medical Informatics".*

In [None]:
import re
import string
total = 0
line_list = []
line_reduced = []
in_text = open('mim2gene.txt', 'r')
for line in in_text:
    line_list = re.split(r'[ \n]+',line)
    line_reduced = [var for var in line_list if var != '']
    total = total + len(line_reduced)
print('Total Words in File:', total)
#exit()

## Script Algorithm: Counting the Words in a File

Open the file*

In [None]:
import re
import string
total = 0
line_list = []
line_reduced = []
in_text = open('mim2gene.txt', 'r')

Parse through file line by line. Split each line into an array. Determine the size of the array. 

In [None]:
for line in in_text:
    line_list = re.split(r'[ \n]+',line)
    line_reduced = [var for var in line_list if var != '']
    total = total + len(line_reduced)

Display the word count

In [None]:
print('Total Words in File:', total)

Exit the script

In [None]:
#exit()

**This section is adapted from section 1.5.1, "Script Algorithm", of pages 12-13 from "Methods in Medical Informatics".*

## Analysis: Counting the Words in a File

The script produces the word count for the OMIM file, currently over 28,000 words, in mere seconds.*

**This section is adapted from section 1.5.2, "Analysis", of page 14 from "Methods in Medical Informatics".*

# Making a Word List with Occurrence Tally

Sometimes you need to have a listing of all the different words in a document, and the number of occurrences of each word. A word frequency list can tell you a lot about a document. The script below will generate a word count and corresponding occurrence tally for a text file. Afterward we will explore both the script and the script output in more detail.*

> This script will utilize the text file [mim2gene.txt](http://datamine.unc.edu/jupyter/edit/Methods-in-Medical-Informatics-master/mim2gene.txt). This is a text file that details the links between the genes in OMIM and other gene identifiers. Additional information [here](https://datamine.unc.edu/data-files/)

**Description adapted from page 14 of "Methods in Medical Informatics".*

In [None]:
import re
import string
word_list = []
freq_list = []
freq = {}
in_text = open('mim2gene.txt', 'r')
in_text_string = in_text.read()
#out_text = open('mimgene_output.txt', 'w')
in_text_string = in_text_string.lower()
word_list = re.findall(r'(\b[a-z]{3,15}\b)', in_text_string)
for item in word_list:
    count = freq.get(item,0)
    freq[item] = count + 1
freq_list = freq.keys()
sort_list = sorted(freq_list)
for i in sort_list:
    print(i, freq[i])
#exit()

## Script Algorithm: Making a Word List with Occurrence Tally

Open the text file to read. Open a separate text file to write the output.*

In [None]:
import re
import string
word_list = []
freq_list = []
freq = {}
in_text = open('mim2gene.txt', 'r')
in_text_string = in_text.read()
#out_text = open('mimgene_output.txt', 'w')

Lowercase the text file

In [None]:
in_text_string = in_text_string.lower()

Create a general regex pattern to identify words

In [None]:
word_list = re.findall(r'(\b[a-z]{3,15}\b)', in_text_string)

Traverse through the text file while matching for individual words. When a match is found, assign it to a dictionary as a key. Also increment the value of the key by one each time a match is found

In [None]:
for item in word_list:
    count = freq.get(item,0)
    freq[item] = count + 1

Sort the dictionary key alphabetically

In [None]:
freq_list = freq.keys()
sort_list = sorted(freq_list)

Print the sorted list

In [None]:
for i in sort_list:
    print(i, freq[i])

Exit the script

In [None]:
#exit()

**This section is adapted from section 1.6.1, "Script Algorithm", of page 14 from "Methods in Medical Informatics".*

## Analysis: Making a Word List with Occurrence Tally

We loaded th entire text of OMIM, a text file exceeding 28,000 words, into a single variable. The script executed in mere seconds creating an output containing thousands of words and the number of times each word occurred within the text file. Here is a short sampling of the output file:

<ul>
    <li>ywhae 1</li>
    <li>ywhag 1</li>
    <li>ywhah 1</li>
    <li>ywhaq 1</li>
    <li>ywhaz 1</li>
    <li>zacn 1</li>
    <li>zan 1</li>
    <li>zfat 1</li>
    <li>zfr 1</li>
    <li>zfx 1</li>
    <li>zfy 1</li>
    <li>zpbp 1</li>
    <li>zwilch 1</li>
    <li>zwint 1</li>
    <li>zxda 1</li>
    <li>zxdb 1</li>
    <li>zxdc 1</li>
    <li>zyx 1</li>
</ul>

Most languages contain between 20,000 and 60,000 words. Comprehensive dictionaries, that contain many more than 60,000 words, include all of the variant forms for a single word (ie. soft, softer, softest, soften, softens, etc...). When we examine the OMIM output list, we see that most of the so-called words are names of people, or misspellings. If you have a large file, and word occurs fewer than three times, it is unlikely to be a valid word. WHen we find a very high-frequency word, such as *with*, it is probably a low-information-content word used to connect other words. When we find a middle-frequency word, such as *kidney*, it is almost certainly a high-information-content word relevant to the document's knowledge domain.*

**This section is adapted from section 1.6.2, "Analysis", of page 16 from "Methods in Medical Informatics".*

# Using Printf Formatting Style

Printf is programming convention which provides a simple way to specify the arrangement of data printed to an output line. Printf produces output in neat columns. The script below will create a word list with an occurrence tally for a MeSH text file. The script will also format the output using the Printf style. Afterward we will explore both the script and the script output in more detail.*

> This script will utilize the binary file [d2020bin](http://localhost:8888/edit/Documents/GitHub/Methods-in-Medical-Informatics/d2020.bin). This is a binary file which list current Medical Subject Headings (MeSH) as of 2020. Additional information [here](https://datamine.unc.edu/data-files/)

In [None]:
import re
import string
word_list = []
freq_list = []
freq = {}
in_text = open('d2020.bin', 'r', encoding='utf-8')
in_text_string = in_text.read()
#out_text = open('mesh_output.txt', 'w')
in_text_string = in_text_string.lower()
word_list = re.findall(r'(\b[a-z]{3,15}\b)', in_text_string)
for item in word_list:
    count = freq.get(item,0)
    freq[item] = count + 1
freq_list = freq.keys()
sort_list = sorted(freq_list)
for i in sort_list:
    print('%-20.20s %8.06d' % (i, freq[i]))
#exit()

**Description adapted from pages 16-17 of "Methods in Medical Informatics".*

## Script Algorithm: Using Printf Formatting Style

Open the text file to read. Open a separate text file to write the output.*

In [None]:
import re
import string
word_list = []
freq_list = []
freq = {}
in_text = open('d2020.bin', 'r', encoding='utf-8')
in_text_string = in_text.read()
#out_text = open('mesh_output.txt', 'w')
in_text_string = in_text_string.lower()

Create a regex expression to identify individual words in the text file. 

In [None]:
word_list = re.findall(r'(\b[a-z]{3,15}\b)', in_text_string)

Iterate through the entire text file. When a match for an individual word is found, create a dictionary with the word as a key and an associated value to serve as a word count. 

In [None]:
for item in word_list:
    count = freq.get(item,0)
    freq[item] = count + 1

Sort the word list alphabetically

In [None]:
freq_list = freq.keys()
sort_list = sorted(freq_list)

Print the formatted list

In [None]:
for i in sort_list:
    print('%-20.20s %8.06d' % (i, freq[i]))

Exit the script

In [None]:
#Exit()

**Description adapted from page 17 of "Methods in Medical Informatics".*

## Analysis: Using Printf Formatting Style

Here is a partial output, showing the tail-end of the output file, listing each occurring word and the number of occurrences in the file:* 

<ul>
    <li>zymogen                000013</li>
<li>zymogenic              000006</li>
<li>zymogens               000001</li>
<li>zymolase               000001</li>
<li>zymomonas              000002</li>
<li>zymosan                000002</li>
<li>zymosans               000001</li>
<li>zyntabac               000001</li>
<li>zyprexa                000001</li>
<li>zyrtec                 000001</li>
<li>zytiga                 000001</li>
<li>zytram                 000001</li>
<li>zyvox                  000001</li>
<li>zyxin                  000003</li>
</ul>

**This section is adapted from section 1.7.2, "Analysis", of page 18 from "Methods in Medical Informatics".*