# Computer Programming and Algorithms

## Week 8.2: Reading Files

<img src="https://github.com/engmaths/EMAT10007_2023/blob/main/weekly_content/img/full-colour-logo-UoB.png?raw=true" width="20%">
</p>

# Aims

In this video we will:

* Open and close a file using a computer program
* Convert the data to a more useable form within the computer program
* Read different types of file

In the following examples we will use:
* the __local path__, not the __global path__
* file path notation for Mac/Linux systems (forward slash `/`) so remember to change this to backslash `\` if you are using windows
* files located in the same directory (downstream) of a computer program

If you are opening the notebook in Google Colab, you can import the data files to use with the code cells throughout the notebook by running the code cell below:

In [2]:
!wget https://raw.githubusercontent.com/engmaths/SEMT10002_2024/main/weekly_labs/Week_08_Reading_Data/README.txt
!wget https://raw.githubusercontent.com/engmaths/SEMT10002_2024/main/weekly_labs/Week_08_Reading_Data/temperature.csv
!wget https://raw.githubusercontent.com/engmaths/SEMT10002_2024/main/weekly_labs/Week_08_Reading_Data/snake.png
!wget https://raw.githubusercontent.com/engmaths/SEMT10002_2024/main/weekly_labs/Week_08_Reading_Data/Document.docx

zsh:1: command not found: wget
zsh:1: command not found: wget
zsh:1: command not found: wget
zsh:1: command not found: wget


# Opening and closing a file using a computer program

Consider the file system below

```python
Week_6/
|
|--- Example_1/
        |
        |--- program_1.py
        |--- README.txt 
```


We can create a *file object* in program_1.py using:

```python
file = open('README.txt')
```

__Object__: A data field that has unique attributes and behaviour (int, string, list...)

Just like other objects, you can give the file object a variable name of your choosing

```python
reader = open('README.txt')
```

```python
my_data = open('README.txt')
```

Some methods that belong to the file object type:

`read`: reads the contents of the file

`close`: closes the file <br>(Before the program exits, the file must be closed)

In [83]:
file = open('README.txt')

print(file.read())

file.close()

Computer programming and algorithms
SEMT10002



Another way to create a file object:

In [86]:
with open('README.txt') as file:
    print(file.read())

Computer programming and algorithms
SEMT10002



Notice that the second line is indented with respect to the first. 

The `with` statement closes the file at the end of the indented block of code.

There is no need to use `close`.

This avoids the situation where the file is left open if:
- you forget to include `close`
- the program terminates due to an error before `close` is executed

We will use the `with open()` structure in all following examples 

# Reading a file using a computer program

Like all objects, the file object type has a set of specific properties and behaviours.

File objects are *iterable* (we can iterate through each item in the object)

Each item is a new line of the file

In [91]:
with open('README.txt') as file:
    for value in file:
        print('Line:', value)

Line: Computer programming and algorithms

Line: SEMT10002



File objects are not *subscriptable* (we can't access an individual element using an index)

In [94]:
with open('README.txt') as file:
    print(file[0])

TypeError: '_io.TextIOWrapper' object is not subscriptable

Once the file is closed, the file object contents can no longer be accessed

In [97]:
with open('README.txt') as file:
    print('Printing file contents...')

for value in file:
    print('Line:', value)

Printing file contents...


ValueError: I/O operation on closed file.

We can *cast* the file object as different object type that makes it easier to manipulate the data within the computer program

By casting the file object as a list, the data:
- is iterable
- is subscriptable
- can be accessed once the file is closed

Each element of the list is a new line of the file

In [100]:
with open('README.txt') as file:
    file = list(file)

    # Iterable
    for value in file:
        print('Line:', value)

    # Subsciptable
    print(file[0])

# Accessed after the file is closed
print(file[1])

Line: Computer programming and algorithms

Line: SEMT10002

Computer programming and algorithms

SEMT10002



# What is a file?

Every file is a set of bytes (eight bits) used to store data. 

The file type determines what these bytes represent. 

For example, in a `.txt` file, each byte represents a character. 

This mapping of bytes to characters is called an *encoding*.



There are different character encodings in use today. 

UTF-8 is the most widely used encoding: <br>https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=bin

| Character| Binary encoding     | Decimal encoding  |  
| :------: | :---------------:   | :---------------: |
| A       | 00101001             | 41            |
| B       | 00101010             | 42            |
| C       | 00101011             | 43            |

When `open` is used, the UTF-8 encoding is applied, meaning and we see characters, not bytes

In [106]:
with open('README.txt') as file:
    print(file.read())

Computer programming and algorithms
SEMT10002



The first 3 characters of 'README.txt' with their UTF-8 binary encodings:

| Character| Binary encoding     | Decimal value  |  
| :------: | :---------------:   | :---------------: |
| C       | 00101011             | 43             |
| o       | 01101111             | 111            |
| m       | 01101101             | 109            |


# Reading different types of file

__Text files__: Human-readable data. <br>Bytes represent plain text characters <br>e.g. .py, .csv, .json, .txt

__Binary files__: Data that is not intended to be human-readable. <br>Bytes do not represent plain text characters, but other information about the file. <br>e.g. executable programs (.exe, .bin), images (.jpg, .png, .gif), audio (.mp3, .wav), video (.mp4, .avi), compressed files (.zip). 

Text file: `.txt`

In [111]:
with open('README.txt') as file:
    print(file.read())

Computer programming and algorithms
SEMT10002



Text file: `.csv`

In [114]:
with open('temperature.csv') as file:
    print(file.read())

Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
6,5,8,10,13,16,18,18,15,13,8,7



Binary file: `.png`

In [117]:
with open('snake.png') as file:
    print(file.read())

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

To open a binary file, we must give a second argument within the parentheses of the `open` function. 

This is called the *mode*

`rb` represents `r`ead and `b`inary

In [120]:
with open('snake.png', 'rb') as file:
    print(file.read())

b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\\\x00\x00\x01\xa2\x08\x03\x00\x00\x00\x9cP\x8ew\x00\x00\x00TPLTEx\xbc>)F\t\x80\xc7D\xfb\xf9\x9fJ\x81\x0f\xcb\xcbd\x05\x06\x03\xef\xefv\xef\xee\xef\xff\xff\xff]\x9b!\xa8\xeee $\x18\xdc\xdcn\xdb\xdb\xdcj\xaa0\xde\x0bHL_#\x98\xdbY\r7[LLT\x9e\x9f\xa4wy\x80\xc0\xc1\xc4\x83\x82C\xa9\xa9V\x81\r-\xfba\x92_\xb6\xb4\xa5\x00\x00 \x00IDATx\xda\xec\x9d\x89v\xe3\xaa\x12E\x85\xa2`I.\xc1\x92\xc2`\xff\xff\x8f>\x8a\x19\r\xb6\xf3\xec\xbe\xdd\t&\xe9L\x02k\xf5\xe6\xe8P\x8cn\x862\xf5ez_}\xe6j\xf3F\xf0\x86\xfb\x86\xfb\xbe\xfa\x86\xfb\x86\xfb\x86\xfb\xbe\xfa#\xe0\xae\xf2\xbd\xe1\xbe\xf0j\xaf\xa5T\x98$\xb39\xdfpo_\xf5\xcc\x10\x98\xd4\xfd\xe6z\x0eV\t\xa0\x14\x80\x10\x02\x14\x946\x7fz\xc3\xbduu\x18\x98\xb4\xc8\x90\x18~\x17\n\to\xcbb.\xd25\xcb\xd9\'N\xa8\xd0\xc3\x1b\xee\xf1U\xa3Fa\x948zf\xcb\xd2\x8c\xb3\x17\xe5P\x94\x95@gn3-\xfe\xe3|n\x0c^\xf6\x86{pu\xe8\r3\xc2=\xd8 \xc9s\xdb\xe5|=\xda\xce\x91\xf5`\x17\xfc\xc9\xe0\x05\x90o\xb8{W\x87A\x0b:\xb7\x9e\xec\x82\xa9]\x16\

The data shown may look confusing but it shows a series of bytes (8 bit binary number)

`\x` indicates a hexadecimal (base 16) number which is another way to represent a byte   

Like text files, the bytes of data in a binary file each encodes a meaning. 

However, unlike text data, the encoding of binary files is not a simple mapping from bytes to human readable characters. 

Meaning of example bytes of `.png` file

| Value shown | Meaning                | Binary encoding              | Hexadecimal value  | Decimal value    |
| :------    |:---------------        |:---------------             | :--------------- |:--------------- |
| `\x89`      | Start of PNG file      |10001001                      | 0x89               | 137              |
| `PNG`       | PNG in ASCII encoding  |1010000,   1001110,    1000111  | 0x50, 0x4E, 0x47     | 80,  78,  71       |
| `\n`        | Unix style line ending*|00001010                      | 0x0A               | 10               |

*Windows line ending is `\r\n`

In a binary file, the meaning of a byte depends on both its:
- value
- position in the file

For example, a binary `.png` file is structured into a series of chunks, composed of bytes, that are used to reconstruct the image when the file is opened in an image viewer or editor. 

Each chunk has a specific purpose e.g.:
- identifying the file type
- holding the colour and transparency of each pixel in the image
- marking the file's end 

Data that is intended to be read by a computer program (e.g text/numerical data) is therefore usually stored as text files rather than binary files if possible. 

__Text files__: e.g. .py, .csv, .json, .txt

__Binary files__: e.g. executable programs (.exe, .bin), images (.jpg, .png, .gif), audio (.mp3, .wav), video (.mp4, .avi), compressed files (.zip)

We will be working with data stored in text file formats for the rest of the unit. 

# Summary

We have learnt:
- How to open and close a file using `with open()`
- How to read the contents of a file using `read()`
- The difference between text files and binary files

### Need to see some more examples? 
https://realpython.com/python-csv/

### Want to take a quiz?
https://realpython.com/quizzes/read-write-files-python/
<br>https://pynative.com/python-file-handling-quiz/

### Want some more advanced information?
https://pynative.com/python/file-handling/#:~:text=To%20read%20or%20write%20a,It%20returns%20the%20file%20object.