Welcome to chapter two of Methods in Medical Informatics! In this section, we will be exploring utility scripts. Utility scripts are small programs that perform a specific task, very efficiently. We will be exploring seven different utility scripts. Lets begin!

> Disclaimer: The content below is adapted from the book "Methods in Medical Informatics - Fundamental of Healthcare Programming in Perl, Python, and Ruby" by Jules J. Berman. All content is for testing, education, and teaching purposes only. No content will be openly released to the internet. 

# Random Numbers

Random numbers are used extensively in Monte Carlo simulations of biological events. The simulations are also used in statistics (ie. calculating normal distributions), and can even provide simple computational approaches to formal mathematical problems. The script below will generate 10 random numbers between 0 and 1. Afterward, we will explore both the script and the script output in more detail.*

**Description adapted from page 21 of "Methods in Medical Informatics".*

In [None]:
import random
for iterations in range(10):
    print(random.uniform(0,1))
#exit()

## Script Algorithm: Random Numbers

Create an iterator that repeats 10 times. Generate and print a random number between zero and one.*

In [None]:
import random
# Create an iterator that repeats 10 times
for iterations in range(10):
    # Generate and print a random number between zero and one
    print(random.uniform(0,1))

**This section is adapted from section 2.1.1, "Script Algorithm", of page 21 from "Methods in Medical Informatics".*

## Analysis: Random Numbers

Here is a sample output, listing 10 random numbers in the range 0 to 1:

<ul>
<li>0.9067868398909231</li>
<li>0.5852017507830499</li>
<li>0.4349374084781388</li>
<li>0.8124019458322805</li>
<li>0.07051266006231838</li>
<li>0.22051335356334767</li>
<li>0.12961389035176352</li>
<li>0.2825144163889062</li>
<li>0.14088841517263262</li>
<li>0.21106545862352621</li>
</ul>

Had we chosen, we could have rendered an integer output by multiplying each random number by 10 and rounding up or down to the closest integer.*

**This section is adapted from section 2.1.2, "Analysis", of page 22 in "Methods in Medical Informatics".*

# Converting Non-ASCII to Base64 ASCII

Almost every computer user has made the mistake of trying to view a non-ASCII file (such as a binary image, or a word-processed file stored in a proprietary format) in a plain-text viewer. Python contain standard modules that will convert any file into BASE64. We will be using the BASE64 modules when we start working with image data conveyed in XML files. The script below will convert the binary file in Base64 ASCII. It will then output the original binary string and the decoded ASCII string. Afterward, we will explore both the script and the script output in more detail.*

> This script will utilize the file sample.bin. This is a binary file which contains which contain a single example string. Additional information [here](https://datamine.unc.edu/data-files/)



**Description adapted from page 22 of "Methods in Medical Informatics".*

In [None]:
import base64
sample_file = open('sample.bin', 'rb')
string = sample_file.read()
sample_file.close()
print(base64.encodebytes(string))
print(base64.decodebytes(base64.encodebytes(string)))
#exit()

## Script Algorithm: Converting Non-ASCII to Base64 ASCII

Call base64 external module into your script

In [None]:
import base64

Read a sample file into a string variable

In [None]:
sample_file = open('sample.bin', 'rb')
string = sample_file.read()

Pass the string variable to the base64 encoding method provided by the module. Print the base64e64 encoded string. 

In [None]:
print(base64.encodebytes(string))

Pass the base 64 encoded string to the decode method provided by the module. Print the decoded string. 

In [None]:
print(base64.decodebytes(base64.encodebytes(string)))

**This section is adapted from section 2.2.1, "Script Algorithm", of page 23 from "Methods in Medical Informatics".*

## Analysis: Converting Non-ACII to Base64 ASCII

Here is an example of a string encoded into Base64:

> b'SGVsbG8uLi4udGhpcyBpcyB0aGUgdGV4dA==\n'<br/>
b'Hello....this is the text'

When we use Base64, we produce output file sthat are larger than the original (binary) files.*

**This section is adapted from section 2.2.2, "Analysis", of page 24 in "Methods in Medical Informatics".*

# Creating a Universally Unique Identifier

The universally unique identifier (UUID) is an algorithm for creating unique string of uniform format composed of name and time information, and distributed with a central registration process. A typical UUID may look like this: 
4c108407-0570-4afb-9463-2831bcc6e4a4. The script below will generate a UUID number. Afterward, we will explore both the script and the script output in more detail.*

**Description adapted from page 24 of "Methods in Medical Informatics".*

In [None]:
import uuid
print(uuid.uuid4())
#exit()

## Script Algorithm: Creating a Universally Unique Identifier

Call external module that creates UUID strings.*

In [None]:
import uuid

Create a new UUID object. Print the UUID string. n

In [None]:
print(uuid.uuid4())

**This section is adapted from section 2.3.1, "Script Algorithm", of pages 24-25 from "Methods in Medical Informatics".*

## Analysis: Creating a Universally Unique Identifier

The algorithms for creating UUIDs, and all of the standard version sof the algorithm, are described in a publicly available Request for Comments file: [https://www.ietf.org/rfc/rfc4122.txt](https://www.ietf.org/rfc/rfc4122.txt)*.

**This section is adapted from section 2.3.2, "Analysis", of page 25 in "Methods in Medical Informatics".*

# Splitting Text into Sentences

Many text parsing algorithms proceed sentence by sentence, not line by line. This is important in machine translation and natural language exercises that  use grammar rule to extract concepts whose parts are scattered through the sentence. It is not uncommon for an information specialist to begin a script by extracting the individual sentences from a narrative text. The script below will split the following string into sentences:

> 'I am here. You are here. We are all here.'

Afterward, we will explore both the script and the script output in more detail.*

**Description adapted from page 25 of "Methods in Medical Informatics".*

In [None]:
import re
all_text = 'I am here. You are here. We are all here.'
sentence_list = re.split(r'[\.\!\?] +(?=[A-Z])', all_text)
print('\n'.join(sentence_list))

## Script Algorithm: Splitting Text into Sentences

Start with a variable containing text*

In [None]:
import re
all_text = 'I am here. You are here. We are all here.'

Split the text wherever there is an occurrence of a period (or other sentence delimiter, such as a question mark or quotation mark), followed by one or more spaces, followed by an uppercase letter

In [None]:
sentence_list = re.split(r'[\.\!\?] +(?=[A-Z])', all_text)

Place the resulting sentences into an array

In [None]:
print('\n'.join(sentence_list))

**This section is adapted from section 2.4.1, "Script Algorithm", of page 26 from "Methods in Medical Informatics".*

## Analysis: Splitting Text into Sentences

The input is:

> 'I am here. You are here. We are all here.'

The output is:

> I am here <br/> You are here <br/> We are all here.

Notice that only the last sentence is terminated by a period. this is because the last sentence does not match the regex pattern. The period of the last sentence is not followed by one or more spaces and an uppercase letter. The last sentence is included in the output only because it was split from the prior match. There are many ways by which we could have corrected for this particular limitation, but sometimes a programmer need sot decide when the performance of a less-than-perfect script is sufficient for his/her intended purposes.*

**This section is adapted from section 2.6.2, "Analysis", of pages 26-27 in "Methods in Medical Informatics".*

# One-Way Hash on a Name

A one-way hash is an algorithm that transforms a string into another string in such a way that the original string cannot be calculated by operations on the hash value. Thus, this operation is referred to as a "one-way" hash. Examples of public domain one-way hash algorithms are MD5 and Secure Hash Algorithm (SHA). These differ from encryption protocols that produce an output that can be decrypted by a second computation on the encrypted string. 

In theory, one-way hashes can be used to anonymize patient records while still permitting researches to accrue data over time to a specific patient's record. Names of patients and other identifiers are replaced by their one-way hash values. If a patient returns to the hospital and has an additional procedure performed, the record identifier, when hashed, will produce the same hash vale held by the original dataset record. The investigator simply adds the data to the "Anonymous" data set record containing the same one-way hash value. Since no identifier in the anonymized data set record can be used to link back to the patient, confidentiality is preserved. 

The script below will create a hash value for an inputted name. Afterward, we will explore both the script and the script output in more detail.*

**Description adapted from pages 27-28 of "Methods in Medical Informatics".*

In [None]:
import sys
import string
import hashlib
line = input('What is your full name?\n')
line = line.encode('utf-8')
md5_object = hashlib.md5()
md5_object.update(line)
print(md5_object.hexdigest())
#exit()

## Script Algorithm: One-Way Hash on a Name

Call an external MD5 module from your script*

In [None]:
import sys
import string
import hashlib

Prompt the user to enter a name

In [None]:
line = input('What is your full name?\n')
line = line.encode('utf-8')

Pass the entered phrase to the MD5 method module

In [None]:
md5_object = hashlib.md5()
md5_object.update(line)

Print the returned one-way hash value

In [None]:
print(md5_object.hexdigest())
#exit()

**This section is adapted from section 2.5.1, "Script Algorithm", of page 28 from "Methods in Medical Informatics".*

## Analysis: One-Way Hash on a Name

There are several available one-way hash algorithms. MD5 is available as a standard module for many different programming languages, but the SHA algorithm is also available. Notice the output is case sensitive. THe output for "Joe Smith" is completely different from the hash values of "joe smith". Those who wish to substitute a hash value for a name must be careful to sue a consistent format for each name.*

**This section is adapted from section 2.5.2, "Analysis", of page 22 in "Methods in Medical Informatics".*

# One-Way Hash on a File

All values produced by the one-way hash algorithm are fixed-length. The one-way hash value for a 10 megabyte file will have the same length as a one-way hash value for a patient's name. A change of a single character in a file will result in a completely different one-way has value for the file. 

By sending a one-way hash value for a file, along with the file itself, you can, with a high degree of confidence, authenticate your file. When others receive your file, along with its MD5 hash that you created, they can recompute the MD5 hash on the file and compare the output with the MD5 hash that you sent. If the two hash numbers are identical, then you can be fairly certain that the file was not altered form the original file (for which the original MD5 value was computed). 

> This script will utilize the file [us.gif](http://datamine.unc.edu/jupyter/view/Methods-in-Medical-Informatics-master/US.GIF). This is an image file which contains an image of the United States . Additional information [here](https://datamine.unc.edu/data-files/)

The script below will create a hash value for a specified file. Afterward, we will explore both the script and the script output in more detail.*

**Description adapted from page 30 of "Methods in Medical Informatics".*

In [None]:
import hashlib
import string
md5_object = hashlib.md5()
sample_file = open('US.GIF', 'rb')
string = sample_file.read()
sample_file.close()
md5_object.update(string)
md5_string = md5_object.digest()
print(''.join([ '%02X' % x for x in md5_string]).strip())
#exit()

## Script Algorithm: One-Way Hash on a File

Import an external standard module that computes the MD5 one-way hash.*

In [None]:
import hashlib
import string

Read the contents of the file into a string. In this case, we use the file, "us.gif".

In [None]:
sample_file = open('us.gif', 'rb')
string = sample_file.read()
sample_file.close()

Call the module function to create MD5 digest value on the contents of the file. 

In [None]:
md5_object.update(string)
md5_string = md5_object.digest()

Print out the digest value in hex format

In [None]:
print(''.join([ '%02X' % x for x in md5_string]).strip())
#exit()

**This section is adapted from section 2.6.1, "Script Algorithm", of page 30 from "Methods in Medical Informatics".*

## Analysis

The script output is:

> 39842F5ED1516D7C541155FD2B093B36

The alphanumeric sequence is the MD5 message digest of the us.gif image file. Changing a single byte in the original file, and repeating the MD5 digest operation will yield an entirely different digest value.*

**This section is adapted from section 2.6.2, "Analysis", of page 31 in "Methods in Medical Informatics".*

# Prime Number Generator

A prime number, by definition, cannot be the product of two integers. If a number is prime, then there will be no smaller number that will divide into the number without producing a remainder. To determine if a number is prime, we can test each smaller
number, to see if it divides into the number without leaving a remainder. If not, then the number is a prime.

We can use a little trick to shorten the process, by stopping the iterations when we have examined every smaller number in ascending order up to the square root of the number. If there were an integer larger than the square root of the number that could be multiplied by another integer to give the number, then the other integer would need to be smaller than the square root of the number (otherwise, the two integers would produce a product larger than the number). But we have already tested all of
the numbers smaller than the square root of the number, and they all yielded a nonzero remainder. So we do not need to test the integers greater than the square root of the number.

Here is how you can generate a very long list of prime numbers with just a few lines of code. Afterward, we will explore both the script and the script output in more detail.

**Description adapted from pages 31-32 of "Methods in Medical Informatics".*

In [None]:
import math
print('2\n3')
state = 1
for i in range(4, 10000):
    upper = math.sqrt(i)
    upper = int(upper)
    for thing in range(2, upper):
        state = 1
        if (i % thing == 0):
            state = 0
            break
    if (state == 1):
        print(i,)
#exit()

## Script Algorithm: A Prime Number Generator

Create a loop for all of the integers up to an arbitrary maximum (1000 in this case). If the integer is prime, then there will be no smaller integer that will divide in to the number with a remainder of 0.*

In [None]:
# Create loop up to a maximum
for i in range(4, 10000):
    upper = math.sqrt(i)
    upper = int(upper)
    # loop through to examine for any integer that can divide into the larger number with no remainder
    for thing in range(2, upper):
        state = 1
        if (i % thing == 0):
            state = 0
            break
    if (state == 1):
        print(i,)

**This section is adapted from section 2.7.1, "Script Algorithm", of page 32 from "Methods in Medical Informatics".*

## Analysis: A Prime Number Generator

Every biomedical scientist who uses medical records and other confidential data can benefit by understanding the role of prime numbers. Almost every cryptographic method relies on methods that produce large prime numbers, which, when multiplied together, produce a number that cannot be factored by a quick computation. Here is the partial output of our method for producing prime numbers:*

<ul>
<li>9887</li>
<li>9901</li>
<li>9907</li>
<li>9923</li>
<li>9929</li>
<li>9931</li>
<li>9941</li>
<li>9949</li>
<li>9967</li>
<li>9973</li>
</ul>

**This section is adapted from section 2.7.2, "Analysis", of page 34 in "Methods in Medical Informatics".*