# What is a file?

## Definition 1
An ordered collection of bytes.

<img src="images/array.png">

Generally, this collection of bytes has a structure of some sort that allows you to interpret the data in a useful way. E.g. - ASCII, jpg, NetCDF, geoTIFF, etc.

## Definition 2
An operating system level API. Operating systems provide an API to storage devices.

<img src="images/disk.png" width=500>


# Order of bytes example: ASCII encoding
[ASCII](https://en.wikipedia.org/wiki/ASCII) is an old standard that associates the lower 7 bits of an 8-bit byte with a character. This is obviously quite limiting because you can only fit 128 characters in the standard. 
<br />
<img src="images/asciifull.gif">

**Note:** ASCII is still pretty ubiquitous and many standards are built on top of ASCII. By default, many file extensions are assumed to use ASCII encoding: *.txt, *.xml, *.json, *.svg, *.py, *.sh, *.ipynb, etc.

All we need is the os library import for this demo

In [1]:
import os

# Integer --> ASCII function
Create a simple function that print the integer value and ASCII character associated with it.

Use the `chr` function to convert each ascii code to a character.

In [2]:
def show_ascii(byte_):
    return "{:03d} => {}".format(byte_,chr(byte_))

Print some examples ...

In [3]:
print(show_ascii(48))
print(show_ascii(64))
print(show_ascii(122))

048 => 0
064 => @
122 => z


# An ASCII-encoded file

## Create a simple file with the characters "GES DISC" inside
"GES DISC.txt" is ascii-encoded. The file extension '.txt' is a big hint that this is ascii. 


<img src="images/GES_DISC_array.svg">

Other ascii-encoded files: '.csv', '.log', '.py', '.sh', etc. Basically, anything you can open with vi/emacs.

In [4]:
ges_disc_file_path = os.path.join("data","GES_DISC.txt")

with open(ges_disc_file_path,'w') as f:
    f.write("GES DISC")


## Example 1: reading the file as a string

In [5]:
with open(ges_disc_file_path,'r') as f:
    str_content = f.read()
    
print("Type: {}".format(type(str_content)))
print(f"Contents: '{str_content}'")

Type: <class 'str'>
Contents: 'GES DISC'


## Example 2: reading the file in binary

First, create a function that will print the contents of the file nicely:

In [6]:
def print_file_info(file_path):
    """Print out each byte's value and ASCII character."""
    print(file_path)
    print("--------\n")
    print("File size: {} bytes\n".format(os.path.getsize(file_path)))

    with open(file_path,'rb') as f:
        content = f.read()

    print("Type: {}\n".format(type(content)))
        

    for i in range(len(content)):
        print("{}: {}".format(i,show_ascii(content[i])))

Show the contents of 'GES_DISC.txt' in binary:

In [7]:
print_file_info(ges_disc_file_path)

data/GES_DISC.txt
--------

File size: 8 bytes

Type: <class 'bytes'>

0: 071 => G
1: 069 => E
2: 083 => S
3: 032 =>  
4: 068 => D
5: 073 => I
6: 083 => S
7: 067 => C


## Example 3: writing a file as binary

Let's make the letters lower case rather than upper case.

Use the `ord` function to convert characters to ascii codes.

In [8]:
def str_to_binary(str_):
    """Take a string and covert each character into the binary representation"""
    bin_arr = bytearray()
    bin_arr.extend(map(ord,str_))
    return bin_arr

In [9]:
# Write out binary!
with open(ges_disc_file_path,'bw') as f:
    f.write(str_to_binary("ges disc"))

# Now print out the new file contents!
print_file_info(ges_disc_file_path)

data/GES_DISC.txt
--------

File size: 8 bytes

Type: <class 'bytes'>

0: 103 => g
1: 101 => e
2: 115 => s
3: 032 =>  
4: 100 => d
5: 105 => i
6: 115 => s
7: 099 => c


## Example 4: more sophisticated binary file access

In the previous examples, we read in the entire file at once. This makes sense for a small file like "GES DISC.txt". But what if the file is too large to fit in memory? Or what if we simply don't need the whole thing? Or what if we want to change a small amount of data in the file without writing out the whole thing?

Enter arbitrary position-based access. This is enabled thanks to the operating system API (definition 2 of a file!), which you access through python's `open` function. The `open` function returns a file pointer that will allow us to move around to arbitrary positions in the file.

Let's see if we can change the string from 'ges disc' to 'ges-disc' by just changing the space character to a dash.

In [10]:
import sys
with open(ges_disc_file_path,'br+') as f:
    # when we open the file, we're at the first byte
    print('Start location: {}\n'.format(f.tell()))
    
    while(f.tell() < 8):  
        pos = f.tell()
        
        sys.stdout.write("Reading one byte ... ")
        curr_char = f.read(1)[0]
        
        print("Read {}, now at position {}.".format(show_ascii(curr_char),f.tell()))
        
        if curr_char == ord(' '):
            print(f"  Space is at position {pos}!!")
            # move back a byte to the position we were in before reading
            f.seek(pos,0)
            print("  Moved back one byte to position {}".format(f.tell()))
            sys.stdout.write("  Writing dash ... ")
            f.write(str_to_binary("-"))
            print("  Finished writing one byte, at position {}.".format(f.tell()))

print("\n\n")
print_file_info(ges_disc_file_path) 

Start location: 0

Reading one byte ... Read 103 => g, now at position 1.
Reading one byte ... Read 101 => e, now at position 2.
Reading one byte ... Read 115 => s, now at position 3.
Reading one byte ... Read 032 =>  , now at position 4.
  Space is at position 3!!
  Moved back one byte to position 3
  Writing dash ...   Finished writing one byte, at position 4.
Reading one byte ... Read 100 => d, now at position 5.
Reading one byte ... Read 105 => i, now at position 6.
Reading one byte ... Read 115 => s, now at position 7.
Reading one byte ... Read 099 => c, now at position 8.



data/GES_DISC.txt
--------

File size: 8 bytes

Type: <class 'bytes'>

0: 103 => g
1: 101 => e
2: 115 => s
3: 045 => -
4: 100 => d
5: 105 => i
6: 115 => s
7: 099 => c
