<a href="https://colab.research.google.com/github/fsk-lab/scics/blob/main/07_Input_Output.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code Input and Output

All the previous tutorials on coding have left out one enormously important aspect. So far, we have defined all variables and assigned all values by directly writing down the values in our Python code. All code outputs were inspected by printing the result to the console output. For learning how to code, this is certainly a good start!

However, in practice, this easily becomes impractical. We don't always want to type values into our Python code! It's tedious, prone to errors, and with large input sizes, simply not possible. At the same time, we want to save our results somewhere, and not only read them from the console output, which disappears after we close the code execution window.

The solution for this is easy – we need to be able to read and write files. And that's exactly what this tutorial covers.

## Reading and Writing Files

Python natively provides support for interacting with files – mainly through the `file` data type and the `open()` function. Therefore, we will first learn the basics of how to handle them. In the later part of this section, we will see that Python contains a number of readers / writers for specific file types.

### The `open()` function

The `open()` function can be flexibly used to access (existing and non-existing) files. It usually takes two positional and one keyword argument, and returns an object of the `file` data type.

```
open(filename: str, mode: str, encoding: str = "utf-8") -> file
```
* `filename` is the absolute or relative path of the file that should be opened.
* `mode` refers to the mode in which the file should be opened. The following are the most common modes:
  - `"r"`: read-only
  - `"w"`: write-only (any file content is overwritten)
  - `"rw"` or `"r+"`: read and write
  - `"a"`: append
  - `"rb"`: read-only in binary mode
  - ...
* `encoding` refers to the way the bit sequence that represents the file is read in. Unless you know that your file follows a different encoding, "utf-8" is a very good assumption.

The `open()` function returns a file-type object.

```
🎮  In the file manager of Google colabs, create a new file called `text_file.txt`.

Paste the following content to it:
-------------------------------------------------------------------------
This is a test text file. It doesn't contain any meaningful information.

But in the next lines, it contains some data, separated by spaces.
124 754 137
44 81 91
1 7 2
7522 9012 5621
-------------------------------------------------------------------------

Now, open this file using the `open()` method and print out the type of the new object.
```

In [None]:
# Try it!

f = open("test_file.txt", "r")

print(type(f))

This file-type object has a number of useful methods for us that we can use, for example:
* `file.read(size: int = None)` returns a string of the first `size` characters of the file. If `size` is not passed, it returns a string of the full file.
* `for x in file: ...` allows us to loop over all lines of the file.
* `f.write(text: str)` writes the `text` to the file. If the file is opened in `"w"` mode, the current content of the file is overwritten. If the file is opened in `"a"` mode, the text is appended to the end of the current content.

In [None]:
for line in f:
    print(line)

> ❗ Once we are done working with a file, we should close it again to avoid memory issues, which can be done with the `file.close()` method.

However, rather than manually opening and closing a file, it is best practice to use a `with` statement, which follows the syntax of:
```
with open("test_file.txt", "r") as f:
    ...  # code that uses f
```

After completion of the code under the `with` statement, the file is automatically closed, and the variable `f` is fully deleted from memory. This is the cleanest way of writing code.



---


🧠 A `with` statement creates a so-called **context manager**. Context managers define variables that only exist within the given `with` block, and are fully deleted and cleaned up afterwards. There can be multiple scenarios in Python code where this is useful – and the standard libary contains the `contextlib` module to create custom context managers for specific purposes.

---

### Reading Files

The `open()` function and the resulting file-type objects give us a basic tool that we can use to process all kinds of files! The key requirement for this, however, is that we know the structure of the file.

In case of the `test_file.txt` from above, all lines starting from line 4 contain three numbers. A natural way of turning this into a Python data structure would be e.g. to generate three lists, each for one column.

For this, we need to know two important functions to process strings:
* `str.split(sep: str, ...) -> list[str]` splits a string at each occurrence of `sep`, and returns a list of sub-strings.
* `str.strip(*chars) -> str` removes all characters passed as `*chars` from the beginning and the end of the string. By default, the `strip` method removes all whitespaces (i.e. spaces, tabs and line breaks) from the beginning and the end of the string.

In [None]:
a = " 124 , 754 , 137 \n"

b = a.split(",")
print(b)

In [None]:
c = [item.strip() for item in b]
print(c)

With this knowledge, we can try to parse the number-containing lines in `test_file.txt` into three lists, one per column.

In [None]:
col_0, col_1, col_2 = [], [], []

with open("test_file.txt", "r") as f:
    all_lines = list(f)  # This is necessary since files are not indexable
    for line in all_lines[3:]:
        numbers = [num.strip() for num in line.split(" ")]
        col_0.append(int(numbers[0]))
        col_1.append(int(numbers[1]))
        col_2.append(int(numbers[2]))

print(col_0)
print(col_1)
print(col_2)

We could now use these lists to do some calculations with it – without needing to define actual variables within the code. This allows us to easily run the same kind of code on lots of different input files!

What we need for this is the knowledge about the exact file structure – e.g. how many lines there are before the actual data starts, or which separator is used in each line. For uncommon file formats, we might have to do this ourselves.

However, there are a number of very common file formats where such "file readers" have already been developed, and have become part of the standard library in Python. We will later learn about the *.csv* and *.json* file types.

### Writing Files

Similarly, we can use the `with open(...` to write new lines to a file. For this, the file needs to be opened in *write* (`w` / `w+` / `rw`) or *append* mode (`a`). Then we can use the `file.write(s: str) -> int` method, which writes the passed string to the file, and returns the number of integers written.

In [None]:
with open("test_file.txt", "a") as f:
    f.write("888 777 666")

If we check the new content of `test_file.txt`, this highlights an important pitfall in writing files – we need to manage line breaks by ourselves!

> 💡  If we open a non-existing file in *write* mode, the file is automatically created after leaving the context manager!

In [None]:
with open("test_number_2.txt", "w") as f:
    f.write("This is line number 1\n")
    f.write("This is the line where I forgot the line break")
    f.write("Another line")

Similar to reading files, there is a number of common file formats for which the standard library alread contains writing functionalities!

### Specific File Types

#### The `csv` file format

*Comma-separated value* files, or short, *.csv* files, are a common file type to store tables in a file. Classically, one line of the file contains one row of the table, and all columns are separated by a comma (even though *.csv* files with other column delimiters exist). *.csv* files are readily readable by a human, and can e.g. be processed by common spreadsheet programs.

An example of a *.csv* file, containing an experimental UV/Vis spectrum, can be found in the `data` directory of the course's [Github repository](https://github.com/fsk-lab/scics).

For parsing *.csv* files, the standard library contains a package called `csv`, which we can use for parsing *.csv* files. We can create a reader object through `csv.reader(file)`, which automatically parses each row in a *.csv* file. We can use this to loop over all rows in a *.csv* file, and for each row, get a list of all values.

In [None]:
import csv

with open("uv_spectrum.csv", "r") as f:
    reader = csv.reader(f)

    for row in reader:
        print(row)

Similarly, the `csv` module contains a `csv.writer(file)` that can be used to write new data into a *.csv* file. The `.writerow(list)` method can be used to write data into a new file.

```
🎮  Use the knowledge about reading and writing csv files to write a small program
that can does the following things:
1) Read in the `uv_spectrum.csv` file
2) Write a new file `uv_spectrum_normalized.csv` in which the second column
   (i.e. the absorption intensities) are normalized to a scale from 0 to 1.
```

In [None]:
# Try the exercise!

> ❗ Unfortunately, there are different "dialects" of *.csv* files (e.g. different delimiters, different usage of spaces, ...), which can make uniform handling of *.csv* files difficult. Therefore, the `csv` module allows us to specify different dialect settings, or to even automatically identify the "dialect" of a specific file (using the `csv.Sniffer()` objects). Many more details on how to use the `csv` package are provided in the [package documentation](https://docs.python.org/3/library/csv.html).

#### The `json` file format

In previous tutorials, we have learned about lists and dictionaries as two of the most important compound data types in Python. In many cases, we want to store these objects (e.g. lists or dictionaries) into a file, and be able to read them in again in another piece of code.

The `Java Script Object Notation` (short: **json**) gives us a standardized way to do so – without needing to define a custom format in which we can read or store our data. In other words: We can save our list or dictionary as a *.json* file – and when we read it in, we directly get our list or dictionary back.

In Python, this can be practically done with the `json` library. The `json.load(file)` function can be used to load data from a *.json* file. The `json.dump(obj, file)` function can be used to save a list or dict object to a file.

In [None]:
import json

elements = {
    "H": [1, 1.01],
    "He": [2, 4.00],
    "Li": [3, 6.94]
}

with open("elements.json", "w") as f:
    json.dump(elements, f)

In [None]:
with open("elements.json", "r") as f:
    elements_reloaded = json.load(f)

print(elements_reloaded)

An additional advantage of the .json file format is the fact that it is readily readable by humans. For example, the "elements.json" file looks as follows:

```
{"H": [1, 1.01], "He": [2, 4.0], "Li": [3, 6.94]}
```

In fact, we can readily write .json files by hand if we account for the following specialties:
* The booleans `True` and `False` are stored as `true` and `false`.
* The `None` value is stored as `null`.

There are a number of additional considerations that need to be taken into account when using the `json` module to store Python objects into *.json* files:
* `tuple` and `set` data types are not supported. *(But they can be converted into a list for saving them in a json file)*
* Values within the `list` or `dict` must be one of the following data types: `int`, `float`, `bool`, `None`. Nested lists or dictionaries (e.g. a list of lists) are also allowed.

For many other data types, the `pickle` library can be used to save any Python object into a file, and re-load it later. `pickle` saves any file in binary – i.e. it directly dumps the bits and bytes that are currently stored in the memory. While this is often practical, these files cannot be readily edited by humans – and often lead to safety concerns.

## Interacting with the Operating System

We can use Python code to interact with the operating system and the file system maintained by the OS. For example, Python code can be used to search, create, move, copy, or delete files in the operating system. For this purpose, the standard library contains a number of useful modules.

### Navigating the file system with `pathlib`

In the early parts of this class, we have seen that any operating system maintains a tree of directories and sub-directories to structure all files on the computer. Every file is identified by its **path**, i.e. a full sequence of directories and sub-directories starting from the *root*.

In a UNIX system, a file path could look like that:
`/Users/felix/sciebo/Teaching/SCICS/Lectures/07_Input_Output.ipynb`

On a Windows system, the file path would look different, e.g. like:
`C:\Users\felix\sciebo\Teaching\SCICS\Lectures\07_Input_Output.ipynb`

To handle file paths in different operating systems in a uniform way, the Python standard library contains the `pathlib` package. Within `pathlib`, the `Path` data type gives us a standardized way operate with file paths, irrespective of the OS.

We can create a new object of the `Path` datatype as `Path(path: str)`

> 💡  In Google colabs, the absolute path of the data folder is `/content`.

In [None]:
from pathlib import Path

test_path = Path("/content/sample_data")

print(type(test_path))

A `Path` object can be used to interact with the operating system, e.g. to find out whether this path refers to a file or to a folder:
* `Path.is_file() -> bool` returns True if the Path object describes a file.
* `Path.is_dir() -> bool` returns True if the Path object describes a folder.

In [None]:
test_path.is_dir()

Moreover, `Path` objects allow us to access the path of parent folders, or files/folders within the current directory.
* `Path / child: str` can be used to get the path of files or folders within the current path.
* `Path.parent -> Path` returns a `Path` object of the parent folder.
* `Path.parents -> List[Path]` returns a list of `Path` objects for each parent.

In [None]:
anscombe_file = test_path / "anscombe.json"

print(type(anscombe_file))
print(anscombe_file)
print(anscombe_file.is_file())

In [None]:
parent_path = test_path.parent

print(parent_path)
print(parent_path.is_dir())

For folders, `pathlib` provides us with some useful tools to loop over all contents of the file. `for x in Path.iterdir()` loops over all files and folders in the directory.

In [None]:
for file in test_path.iterdir():
    print(file)

---
🧠  We can also specify the loops that `pathlib` allows us to do. For example, we can loop over all file names that have a specific pattern in them:
* `for x in Path.glob(pattern: str)` loops over all files and folders in the directory that contain the pattern.
* `for x in Path.rglob(pattern: str)` loops over all files and folders in the directory and all sub-directories that contain the pattern.

For example, the following code would loop over all .csv files that are in the `/content` directory or any of its sub-directories.
```
from pathlib import Path

our_path = Path("/content")

for file in our_path.rglob("*.csv"):
    print(file)
```

---

In principle, we can create a `Path` object of an arbitrary path in the operating system – this path does not need to exist in reality. We can use the `Path.exists() -> bool` method to find out if a path exists in the operating system or not.

If it does not exist, we can use `pathlib` to create a folder with this path, using the `Path.mkdir()` method.

In [None]:
our_path = Path("/content")

new_dir = our_path / "test_dir"
print(new_dir.exists())

In [None]:
new_dir.mkdir()
print(new_dir.exists())

Similarly, we can also remove (empty) directories using the `Path.rmdir()` function.

In [None]:
new_dir.rmdir()
print(new_dir.exists())

### Moving, Copying and Deleting Files

Whereas `pathlib` is mainly intended for interacting with the file system (handling file paths and directories), it is not made for actually operating on files. If we want to move, copy, or delete files, the standard library provides us with a specific module for this, which is called "shell utilities", or short `shutil`.

Shutil does not provide us with new data types, but with a range of useful functions to handle files:
* `shutil.move(src: Path, dst: Path)` moves the file at the source path to the destination path.
* `shutil.copy(src: Path, dst: Path)` copies the file at the source path to the destination path.
* `shutil.rmtree(path: Path)` removes a file or folder at the given path. In case of a folder, all contents are removed, too.

In [None]:
from pathlib import Path
import shutil

home_dir = Path("/content")

shutil.copy(home_dir / "sample_data" / "anscombe.json", home_dir / "test.json")

with open(home_dir / "test.json") as f:
    for line in f:
        print(line)

In [None]:
new_dir = home_dir / "test_dir"
new_dir.mkdir()

test_file = home_dir / "test.json"
shutil.move(test_file, new_dir / "test.json")

In [None]:
shutil.rmtree(new_dir)

## Interactive Programs and Console Input

In principle, we can use Python to write small interactive programs, in which the user can provide some input through the command line, which can then be used by the Python program.

For this, Python contains the built-in `input(prompt: str) -> str` function, which waits for the user to provide some input. We can store this input in a new variable.

In [None]:
var = input("Please give me some input! ")

In [None]:
print(var)

In principle, we could use this to write a little – arguably very limited – chat bot.

In [None]:
inp = input("    Give me a hot take! \n")

while True:
    print(f"    Interesting... {inp} I disagree. Give me another hot take!")
    inp = input("")

Anything that comes through the console input will initially be interpreted as a string – but, as we have seen before, we can convert the input to any data type we want.