# NB22: Files

## Programming Fundamentals

## L.EIC/2022-23

#### João Correia Lopes$^{1}$, Nuno Macedo$^{1}$, Pedro Vasconcelos$^{2}$
$^{1}$FEUP/DEI & INESC TEC\
$^{2}$FCUP/DCC & LIACC

> I have files, I have computer files and, you know, files on paper. But most of it is really in my head. So God help me if anything ever happens to my head!

George R. R. Martin



## Goals

By the end of this class, the student should be able to:

- Describe reading data from external storage to be manipulated by the program

- Describe how to make data outlive the program that creates it

## Bibliography

- Peter Wentworth, Jeffrey Elkner, Allen B. Downey, and Chris Meyers, *How to Think Like a Computer Scientist — Learning with Python 3* (Chapter 13) [[HTML](http://openbookproject.net/thinkcs/python/english3e/files.html)]

- Brad Miller and David Ranum, *How to Think Like a Computer Scientist: Interactive Edition*. Based on material by Jeffrey Elkner, Allen B. Downey, and Chris Meyers (Chapter 11) [[HTML](https://runestone.academy/ns/books/published/thinkcspy/Files/toctree.html)]

# 22 Files

## 21.1 Persistence & I/O

> **And now for something completely different!** (*Flying Circus*)
>
> -  Rather than avoiding side effects (effect-free programmings style)
>
> - ... we focus on achieving persistence (doing I/O)

> "most **real computer programs** must retrieve stored information and
> record information for future use."

### Persistence

> "In computer science, *persistence* refers to the characteristic of
> **state that outlives the process that created it**.
>
> This is achieved in practice by storing the state as data in computer
> data storage.
>
> Programs have to transfer data to and from storage devices and have to
> **provide mappings** from the native programming-language data
> structures to the storage device data structures."
> [[Wikipedia]](https://en.wikipedia.org/wiki/Persistence_(computer_science))

### About files

- While a program is running, its data is stored in random access memory (RAM)

- RAM is faster than network or disks, but it is also **volatile**

- To make data available the next time the program is started, it has to be written to a **non-volatile** storage medium

- Data on non-volatile storage media is stored in named locations called **files**

### Finding a File on your Disk

- Opening a file requires that you, as a programmer, and Python agree about the location of the file on your disk

- The way that files are located on disk is by their **path**

- You can think of the **filename** as the short name for a file, and the path as the full name.

![Tree](https://raw.githubusercontent.com/fp-leic/public/main/notebooks/22/tree.png)

## 22.2 Writing our first file

- Opening a file creates what its called a file **handle**

- Our program calls methods on the handle, and this makes changes to the actual file which is usually located on our disk

- Let's begin with a simple program that writes three lines of text into a file:

```
  with open("test.txt", "w") as myfile:
      myfile.write("My first file written from Python\n")
      myfile.write("---------------------------------\n")
      myfile.write("Hello, world!\n")
```

- You may as well use: `f = open("workfile", "w")`

- But, if you're not using the `with`, then you should call `f.close()` to close the file and immediately free up any system resources used by it

$\Rightarrow$
<https://github.com/fp-leic/public/blob/master/lectures/22/myfile.py>


### Modes

- To manipulate files one needs to provide the path to the file and the **mode** for `open()`

| **Character** | **Meaning** |
|:-------------:|:------------|
|  'r'          |  open for reading (default) |
|  'w'          |  open for writing, truncating the file first |
|  'x'          |  open for exclusive creation, failing if the file already exists |
|  'a'          |  open for writing, appending to the end of the file if it exists |
|  'b'          |  binary mode |
|  't'          |  text mode (default) |
|  '+'          |  open a disk file for updating (reading and writing) |

$\Rightarrow$
<https://docs.python.org/3/library/functions.html#open>

With mode "w", if there is no file named `first.txt` on the disk, it will be created. \
If there already is one, it will be replaced by the file we are writing.

In [None]:
with open("first.txt", "w") as myfile:
    myfile.write("My first file written from Python\n")
    myfile.write("---------------------------------\n")
    myfile.write("Hello, world!\n")

## 22.3 Reading a file line-at-a-time

- Now that the file exists on our disk, we can open it, this time for reading, and read all the lines in the file, one at a time

- The `for` statement in line 2 reads everything up to **and including the newline character**

```
   with open("test.txt", "r") as my_handle:
       for the_line in my_handle:
           # Do something with the line we just read. Here we just print it.
           print(the_line, end="")
```

$\Rightarrow$
<https://github.com/fp-leic/public/blob/master/lectures/22/myfile.py>

Let's read our `first.txt` file.

In [None]:
with open("first.txt", "r") as myfile:
    for the_line in myfile:
        # Do something with the line we just read. Here we just print it.
        print(the_line, end="")

It is also possible to open the files, do something with it but you should remember to close it at the end!


In [None]:
f = open("first.txt", "r")
contents = f.readline()
print(contents)
f.close()

However:

* the `with open(...)` block above ensures that the file handle is *always* closed at the end (even if the code inside the block fails for some reason)
* we should use it instead of manually using `open` and `close`

## 22.4 Turning a file into a list of lines

- It is often useful to fetch data from a disk file and turn it into a list of lines

- The `readlines` method in line 2 reads all the lines and returns a list of the strings

    - We could read each line one-at-a-time and build up the list ourselves, but it is a lot easier to use the method that the Python implementors gave us!

```
  with open("players.txt", "r") as input_file:
      all_lines = input_file.readlines()
```

$\Rightarrow$
<https://github.com/fp-leic/public/blob/master/lectures/22/players.py>


Get a list of lines from a file.

In [None]:
!wget https://raw.githubusercontent.com/fp-leic/public/main/lectures/22/files/players.txt

In [None]:
# with open("files/players.txt", "r") as input_file:
with open("/content/players.txt", "r") as input_file:
    all_lines = input_file.readlines()

print(all_lines)

Sort lines and write back to the file system.

In [None]:
all_lines.sort()

with open("sorted_players.txt", "w") as output_file:
    for line in all_lines:
        output_file.write(line)

## 22.5 Reading the whole file at once

- Another way of working with text files is to read the complete contents of the file into a string, and then to use our string-processing skills to work with the contents

- By default, if we don't supply the mode, Python opens the file for reading

```
  with open("somefile.txt") as f:
      content = f.read()
   
  words = content.split()
  print(f"There are {len(words)} words in the file.")
```

$\Rightarrow$
<https://github.com/fp-leic/public/blob/master/lectures/22/players2.py>

Read the whole file to a variable string.

In [None]:
with open("players.txt") as f:
    content = f.read()

words = content.split()

print(f"There are {len(words)} words in the file.")

### Methods of File Objects

| **Method**             | **Description** |
|:-----------------------|:--------------- |
| `f.read()`             | reads the entire file |
| `f.readline()`         | reads a single line from the file |
| `f.write(string)`      | writes the contents of string to the file |
| `f.tell()`             | returns an integer giving the file object's current position |
| `f.seek(offset, from)` | changes the file object's position |

$\Rightarrow$
<https://docs.python.org/3.6/tutorial/inputoutput.html#methods-of-file-objects>

## 22.6 Working with binary files

* Files that hold photographs, videos, zip files, executable programs, etc. are called **binary files**
* when we read from the binary file we’re going to get **bytes** back rather than a string


```
f = open("somefile.zip", "rb")
g = open("thecopy.zip", "wb")

while True:
    buf = f.read(1024)  # attempt to read 1024 bytes
    if len(buf) == 0:   # that's the EOF
         break
    g.write(buf)        # write those bytes

f.close()
g.close()
```

What will `type(buf)` return?

## 22.7 An example

### A filter example

- Here is a filter that copies one file to another, omitting any lines that begin with \#:

```
  def filter(oldfile, newfile):
      with open(oldfile, "r") as infile, open(newfile, "w") as outfile:

          for line in infile:

              # Put any processing logic here
              if not line.startswith('#'):
                  outfile.write(line)
```

$\Rightarrow$
<https://github.com/fp-leic/public/blob/master/lectures/22/filter.py>

Define filter function.

In [None]:
def filter(oldfile, newfile):
    with open(oldfile, "r") as infile, open(newfile, "w") as outfile:
        for line in infile:
            # Put any processing logic here
            if not line.startswith('#'):
                outfile.write(line)

Process the file.

In [None]:
filter("files/filter.py", "files/filter.txt")

## 22.8 Directories

- Files on non-volatile storage media are organized by a set of rules known as a **file system**

- File systems are made up of files and directories, which are containers for both files and other directories

- When we open a file for reading, Python looks for it in the current directory

- If we want to open a file somewhere else, we have to specify the path to the file, which is the name of the directory (or folder) where the file is located

```
  >>> wordsfile = open("/usr/share/dict/words", "r")
  >>> wordlist = wordsfile.readlines()
  >>> print(wordlist[:7])
  ['A\n', "A's\n", 'AMD\n', "AMD's\n", 'AOL\n', "AOL's\n", 'Aachen\n']
```

## 22.9 What about fetching something from the Web?

### Fetching from the Web

- Here is a very simple example that copies the contents at some Web URL to a local file

- The `urlretrieve` function could be used to download any kind of content from the Web

- The resource we're trying to fetch must exist (check it using a browser)

```
  import urllib.request

  url = "https://www.ietf.org/rfc/rfc793.txt"
  destination_filename = "rfc793.txt"

  urllib.request.urlretrieve(url, destination_filename)
```

$\Rightarrow$
<https://github.com/fp-leic/public/blob/master/lectures/22/scraping.py>

Using module urllib:

In [None]:
import urllib.request

url = "https://www.ietf.org/rfc/rfc793.txt"
destination_filename = "rfc793.txt"

urllib.request.urlretrieve(url, destination_filename)

print("\nWritten in", destination_filename)

### Fetching from the Web using `requests`

- The module `requests` is not part of the standard library

- It is easier to use and significantly more potent than the `urllib` module (see [[docs]](http://docs.python-requests.org))

- Read the web resource directly into a string and print that string

```
  import requests
    
  url = "https://www.ietf.org/rfc/rfc793.txt"
  response = requests.get(url)
  print(response.text)
```

$\Rightarrow$
<https://github.com/fp-leic/public/blob/master/lectures/22/scraping.py>

Now with requests:

In [None]:
import requests

response = requests.get(url)

print(response.text)

We may as well do:

In [None]:
for line in response:
    print(line)

# Further reading

### Web & Databases

- Web Scraping in Python (using `BeautifulSoup`): [[Beginner's guide]](https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/)

- Data Persistence: [[The Python Standard Library]](https://docs.python.org/3.6/library/persistence.html)

- DB-API 2.0 interface for SQLite databases: [[The Python Standard Library]](https://docs.python.org/3.6/library/sqlite3.html)


### Text Files in Python

Python Tutorial || Learn Python Programming -- Socratica

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('4mX0uPQFLDU')

-- João Correia Lopes, Nuno Macedo & Pedro Vasconcelos