## File types: Text vs Binary

- You probably know many file types such as Images (png, jpg, ...), Word, Excel, mp3, mp4, csv, and now also .py files.
- Internally there are two big categories. **Text and Binary files**.
> - Text files are the ones that look readable if you open them with a plain text editor such as Notepad.
> - Binary files will look like a mess if you opened them in Notepad.
- For Binary files you need a special application to "look" at their content.
- For example the Excel and Word programs for the appropriate files.
- Some image viewer application to view all the images.
- VLC to look at an mp4. Some application to hear the content of mp3 files.

- **Text**: Can make sense when opened with Notepad: .txt, csv, .py, .pl, ..., HTML , XML, YAML, JSON
- **Binary**: Need specialized tool to make sense of it: Images, Zip files, Word, Excel, .exe, mp3, mp4


> 1. In Python you have specialized modules for each well-knonw binary type to handle the files of that format.
>
> 2. Text files on the other hand can be handled by low level file-reading functions, however even for those we usually have modules that know how to read and interpret the specific formats. (e.g. CSV, HTML, XML, YAML, JSON parsers)


## Open vs. Read vs. Load

- The expression "open a file" has two distinct meanings for programmers and users of software. 
- For a user of Word, for example, "open the file" would mean to be able to see its content in a formatted way inside the editor.
- When a programmer - now acting as a regular user - opens a Python file in an editor such as Notepad++ or Pycharm, the expectation is to see the content of that program with nice colors.

- However in order to provide this the programmer behind these applications had to do several things.

> 1. Connect to a file on the disk (aka. "opening the file" in programmer speak).
Read the content of the file from the disk to memory.
>
> 2. Format the content read from the file as expected by the user of that application.

## Binary files: Images
- This is just a quick example how to use the Pillow module to handle images. There is a whole chapter on dealing with images.


In [10]:
from PIL import Image

# Open an image file
image = Image.open('__FILES/image.jpg')

# Display basic information about the image
print('Image format:', image.format)
print('Image size:', image.size)
print('Image mode:', image.mode)

# Resize the image
new_size = (800, 600)
resized_image = image.resize(new_size)
resized_image.save('__FILES/resized_image.jpg')

# Convert the image to grayscale
grayscale_image = image.convert('L')
grayscale_image.save('__FILES/grayscale_image.jpg')

# Crop a region of the image
box = (100, 100, 500, 400)  # (left, upper, right, lower)
cropped_image = image.crop(box)
cropped_image.save('__FILES/cropped_image.jpg')

# Rotate the image
angle = 45
rotated_image = image.rotate(angle)
rotated_image.save('__FILES/rotated_image.jpg')

# Flip the image horizontally
flipped_image = image.transpose(Image.FLIP_LEFT_RIGHT)
flipped_image.save('__FILES/flipped_image.jpg')

# Flip the image vertically
flipped_image = image.transpose(Image.FLIP_TOP_BOTTOM)
flipped_image.save('__FILES/flipped_image.jpg')

# Close the image file
image.close()

Image format: JPEG
Image size: (194, 260)
Image mode: RGB


## Reading an Excel file

- There are many ways to deal with Excel files as well.


In [13]:
from openpyxl import Workbook
from openpyxl import load_workbook

# Create a new workbook
workbook = Workbook()
sheet = workbook.active

# Write data to the cells
sheet['A1'] = 'Name'
sheet['B1'] = 'Age'
sheet['C1'] = 'Country'

data = [
    ('John', 25, 'USA'),
    ('Emily', 30, 'Canada'),
    ('Michael', 35, 'UK'),
]

for row in data:
    sheet.append(row)

# Save the workbook
workbook.save('__FILES/sample.xlsx')

# Load an existing workbook
loaded_workbook = load_workbook('__FILES/sample.xlsx')
loaded_sheet = loaded_workbook.active

# Read data from the cells
for row in loaded_sheet.iter_rows(min_row=1, values_only=True):
    print(row)

# Modify the data in the cells
loaded_sheet['B2'] = 26
loaded_workbook.save('__FILES/sample.xlsx')

# Close the workbook
loaded_workbook.close()


('Name', 'Age', 'Country')
('John', 25, 'USA')
('Emily', 30, 'Canada')
('Michael', 35, 'UK')


## Reading a YAML file

In [14]:
import yaml

# Write data to a YAML file
data = {
    'name': 'John',
    'age': 25,
    'country': 'USA'
}

with open('__FILES/data.yaml', 'w') as file:
    yaml.dump(data, file)

# Read data from a YAML file
with open('__FILES/data.yaml', 'r') as file:
    loaded_data = yaml.safe_load(file)

print(loaded_data)

# Modify the data
loaded_data['age'] = 26

# Write the modified data back to the YAML file
with open('__FILES/data.yaml', 'w') as file:
    yaml.dump(loaded_data, file)


{'age': 25, 'country': 'USA', 'name': 'John'}


## Read and analyze a text file

In [16]:
import sys

filename = '__FILES/report.txt'

total = 0
with open(filename, "r") as fh:
    for row in fh:
        if "Report" not in row:
            continue
        text, value = row.split(":")
        # print(value)
        value = float(value.strip())
        # print(value)
        total += value

print(total)

204.0


## Open and read file (easy but not recommended)

- In some code you will encounter the following way of opening files.
- This was used before "with" was added to the language.
- It is not a recommended way of opening a file as you might easily forget to call "close" and that might cause trouble. For example you might loose data. **Don't do that.**


In [25]:

filename = '__FILES/report.txt'

fh = open(filename, 'r')
for i, line in enumerate(fh):
    if i<3:
        print(line)
fh.close()

This is a text report there are some lines that start with

Report: 23

Other linese has this somewhere in the middle.



## Open and read file using with (recommended)

In [22]:
filename = '__FILES/report.txt'

with open(filename, 'r') as fh:   # open(filename) would be enough
    count = 0
    for line in fh:
        if count < 3:  # print first 3 lines
            count +=1
            print(line)               # duplicate newlines

# close is called when we leave the 'with' context

This is a text report there are some lines that start with

Report: 23

Other linese has this somewhere in the middle.



## Read file remove newlines

In [24]:
filename = '__FILES/report.txt'

with open(filename, 'r') as fh:
    for i, line in enumerate(fh):
        if i > 3:
            break
        line = line.rstrip("\n")
        print(line)

This is a text report there are some lines that start with
Report: 23
Other linese has this somewhere in the middle.



## Filename on the command line

In [None]:
import sys

def main():
    if len(sys.argv) != 2:
        exit(f"Usage: {sys.argv[0]} FILENAME")
    filename = sys.argv[1]
    with open(filename) as fh:
        print("Working on the file", filename)

main()

## Filehandle with return

In [26]:
import sys

def process_file(filename):
    with open(filename, 'r') as fh:

        for line in fh:
            line = line.rstrip("\n")
            if len(line) > 0 and line[0] == '#':
                return

            if len(line) > 1 and line[0:2] == '//':
                return

            # process the line
            print(line)


process_file('__FILES/report.txt')

This is a text report there are some lines that start with
Report: 23
Other linese has this somewhere in the middle.

Begin report

Report: -3

Like this. Report: 17
More lines starting with
Report: 44

End report

We will have some exercise with this file. Maybe 4 exercises.
Report: 123


## Read all the lines into a list

In [27]:
filename = '__FILES/test.txt'

with open(filename, 'r') as fh:
    lines = fh.readlines()   # reads all the lines into a list

print(f"number of lines: {len(lines)}")

for line in lines[:3]:
    print("## " + line, end="")

print('------')

lines.reverse()
for line in lines[-3:]:
    print(line, end="")

number of lines: 37
## File types: Text vs Binary
## Open vs. Read vs. Load
## Binary files: Images
------
Binary files: Images
Open vs. Read vs. Load
File types: Text vs Binary


## Read all the characters into a string (slurp)
- In some other cases, especially if you are looknig for some pattern that starts on one line but ends on another line. you'd be better off having the whole file as a single string in a variable. This is where the read method comes in handy.
- It can also be used to read in chunks of the file.

In [30]:
filename = '__FILES/test.txt'

with open(filename, 'r') as fh:
    content = fh.read()   # reads all the lines into a string

print(type(content))
print(len(content))   # number of characters in file


<class 'str'>
983


## Not existing file

In [31]:
filename = '__FILES/unicorns.txt'

with open(filename, 'r') as fh:
    lines  = fh.read()
print("still running")

FileNotFoundError: [Errno 2] No such file or directory: '__FILES/unicorns.txt'

## Open file exception handling

In [33]:
filename = '__FILES/unicorns.txt'

try:
    with open(filename, 'r') as fh:
        lines = fh.read()
except Exception as err:
    print('There was some error in the file operations.')
    print(type(err).__name__)
    print(err)

print('Still running.')

There was some error in the file operations.
FileNotFoundError
[Errno 2] No such file or directory: '__FILES/unicorns.txt'
Still running.


## Open many files - exception handling

In [38]:
import sys

def main(args):
    for filename in args:
        try:
            do_some_stuff('__FILES/' + filename)
        except Exception as err:
            print(f"trouble with '{filename}': Error: {err}")

def do_some_stuff(filename):
    with open(filename) as fh:
        total = 0
        count = 0
        for line in fh:
            number = float(line)
            total += number
            count += 1
        print(filename, "Average: ", total/count)

main(['numbers.txt', 'numbers2.txt'])

__FILES/numbers.txt Average:  58.25
__FILES/numbers2.txt Average:  39.333333333333336


## Writing to file

- In order to write to a file we open it passing the "w" write mode. If the file did not exist it will try to create it. If the file already existed it will remove all its content so after such call to open we'll end up with an empty file if we don't write into it.
- Once the file is opened we can use the write method to write to it. This will NOT automatically append a newline at the end so we'll have to include \n if we would like to insert a newline.

- Opening the file will fail if we don't have write permissions or if the folder in which we are trying to create the file does not exist.



In [39]:

filename = '__FILES/data.txt'

with open(filename, 'w') as out:
    out.write('text\n')

## Print to file
- We can also use the print function to print (or write) to a file. In this case the same rules apply as printing to standard output (automatically adding a trailing newline, inserting a space between parameters).
- We do this by passing the file-handle as the value of the file parameter of print.


In [40]:

filename = '__FILES/data.txt'
with open(filename, 'w') as fh:
    print("Hello", "World", file=fh)

## Append to file

In [41]:
filename = '__FILES/data.txt'
with open(filename, 'a') as out:
    out.write('append more text\n')

## Binary mode

In [42]:
import sys

filename = '__FILES/binary.txt'

try:
    with open(filename, 'rb') as fh:
        while True:
            binary_str = fh.read(1000)
            print(len(binary_str))
            if len(binary_str) == 0:
                break
            # do something with the content of the binary_str
except Exception:
    pass

## Does file exist? Is it a file?

In [45]:
import os

# Example file path
file_path = '__FILES/data.txt'
directory_path = '__FILES/temp/'

# Check if the path exists
if os.path.exists(file_path):
    print(f"The file {file_path} exists.")
else:
    print(f"The file {file_path} does not exist.")

# Check if the path refers to a file
if os.path.isfile(file_path):
    print(f"The path {file_path} is a file.")
else:
    print(f"The path {file_path} is not a file.")

# Check if the path refers to a directory
if os.path.isdir(directory_path):
    print(f"The path {directory_path} is a directory.")
else:
    print(f"The path {directory_path} is not a directory.")


The file __FILES/data.txt exists.
The path __FILES/data.txt is a file.
The path __FILES/temp/ is not a directory.


## Direct access of a line in a file

In [49]:
filename = '__FILES/report.txt'
with open(filename, 'r') as fh:
    print(fh[2])

TypeError: '_io.TextIOWrapper' object is not subscriptable

In [52]:
line = 2
with open(filename, 'r') as fh:
    rows = fh.readlines()
print(f"line {line} : {rows[line]}")

line 2 : Other linese has this somewhere in the middle.



## Exercise: count digits


- Given the file __FILES/ex1.txt (or a similar file), 
- write a script that will count how many times each digit appears? The output will look like this. Just different values.
 
0 0 <br/>
1 3 <br/>
2 3 <br/>
3 4 <br/>
4 2 <br/>
5 2 <br/>
6 1 <br/>
7 2 <br/>
8 1 <br/>
9 1 <br/>


In [None]:
import sys

if len(sys.argv) < 2:
    exit("Need name of file.")

counter = [0] * 10
filename = sys.argv[1]
with open(filename) as fh:
    for line in fh:
        for c in line.rstrip("\n"):
            if c == ' ':
                continue

            c = int(c)
            counter[c] += 1

for i in range(10):
    print("{} {}".format(i, counter[i]))

## Exercise: remove newlines
- write a script that will be able to read all the lines of a given file into a list and remove trailing newlines.

In [None]:
import sys
filename = sys.argv[0]
with open(filename) as fh:
    lines = []
    for line in fh:
        lines.append(line.rstrip("\n"))
    print(lines)

## Exercise: print lines with Report:
- In many cases you get some text report in some free form of text (and not in a CSV file or an Excel file.) - You need to extract the information from such a file after recognizing the patterns. This exercise tries to provide such a case.

- Given a `__FILES/report.txt`

- Print out the first line that starts with Report:.
- Print out all the lines that have the string Report: in it.
- Print out all the lines that start with the string Report:.
- Print out the numbers that are after Report:. (e.g. Report: 42 print out 42)
- Add the numbers that after after the string Report:. So in the above example the result is expected to be 204.
- Do the same, but only take account lines between the Begin report and End report section. (sum expected to be 58)



In [None]:
import sys


def main():
    if len(sys.argv) !=2:
        exit(f"Usage: {sys.argv[0]} FILENAME")
        # text_report.txt

    in_file = sys.argv[1]
    show_rows_with_report(in_file)
    show_rows_start_with_report(in_file)
    show_numbers_after_report(in_file)
    sum_numbers_after_report(in_file)
    sum_numbers_after_report_within_begin_end_section(in_file)


def show_rows_with_report(in_file):
    with open(in_file) as fh:
        for row in fh:
            row = row.rstrip("\n")
            if 'Report:' in row:
                print(row)
    print('-' * 20)

def show_rows_start_with_report(in_file):
    with open(in_file) as fh:
        for row in fh:
            row = row.rstrip("\n")
            if row.startswith('Report:'):
                print(row)
    print('-' * 20)

def show_numbers_after_report(in_file):
    with open(in_file) as fh:
        for row in fh:
            row = row.rstrip("\n")
            if 'Report:' in row:
                parts = row.split(':')
                print(int(parts[1]))
    print('-' * 20)

def sum_numbers_after_report(in_file):
    total = 0
    with open(in_file) as fh:
        for row in fh:
            row = row.rstrip("\n")
            if 'Report:' in row:
                parts = row.split(':')
                total += int(parts[1])
    print(f"Total: {total}")
    print('-' * 20)

def sum_numbers_after_report_within_begin_end_section(in_file):
    in_section = False
    total = 0
    with open(in_file) as fh:
        for row in fh:
            row = row.rstrip("\n")
            if row == 'Begin report':
                in_section = True
                continue
            if row == 'End report':
                in_section = False
                continue
            if in_section:
                if 'Report:' in row:
                    parts = row.split(':')
                    total += int(parts[1])
                    print(int(parts[1]))
    print(f"Total in section: {total}")
    print('-' * 20)


main()

## Exercise: color selector

- Create a file similar to the __FILES/colors.txt file and use it as the list of colors in the earlier example where we prompted for a color.




In [None]:
def main():
    try:
        with open('colors.txt') as fh:
            colors = []
            for line in fh:
                colors.append(line.rstrip("\n"))
    except IOError:
        print("Could not open colors.txt")
        exit()

    for i in range(len(colors)):
        print("{}) {}".format(i, colors[i]))

    c = int(input("Select color: "))
    print(colors[c])

main()

## Exercise: Combine lists

- file_a
- `Tomato=78`
- `Avocado=23`
- `Pumpkin=100`



- file_b
- `Cucumber=17`
- `Avocado=10`
- `Cucumber=10`


Write a script that takes the two files and combines them adding the values for each vegetable. The expected result is:


- file_c
- `Avocado=33`
- `Cucumber=27`
- `Pumpkin=100`
- `Tomato=78`




In [None]:
files = ['examples/files/a.txt', 'examples/files/b.txt']
names = []
values = []

for filename in files:
    with open(filename) as fh:
        for line in fh:
            name, value = line.rstrip("\n").split("=")
            value = int(value)
            if name in names:
                idx = names.index(name)
                values[idx] += value
            else:
                names.append( name )
                values.append( value )

with open('out.txt', 'w') as fh:
    for ix in range(len(names)):
        fh.write("{}={}\n".format(names[ix], values[ix]))

## Exercise: Number guessing game - save to file

## Filehandle using with and not using it

In [None]:
filename = 'examples/files/numbers.txt'

fh = open(filename, 'r')
print(fh)      # <open file 'numbers.txt', mode 'r' at 0x107084390>
data = fh.read()
# do something with the data
fh.close()
print(fh)      # <closed file 'numbers.txt', mode 'r' at 0x107084390>



with open(filename, 'r') as fh:
    print(fh)  # <open file 'numbers.txt', mode 'r' at 0x1070840c0>
    data = fh.read()
print(fh)     # <closed file 'numbers.txt', mode 'r' at 0x1070840c0>