### 02 - Python Fundamentals Part 2

#### Outline

* File I/O
* Lists, Dictionaries, Loops, and Comprehensions


----

### File I/O

Saving and loading files describing inputs, configuration, and results is essential to reproducible and traceable analysis.

In [None]:
# These are part of the standard library
import os
from pathlib import Path

# These are libraries that provide "dataframes" and CSV I/O
import pandas as pd
import polars as pl

# These are libraries that provide some nested-structure I/O
import json
import yaml
import pydantic
from pydantic import BaseModel  # Good library

# Scipy provides matlab I/O
from scipy.io import savemat

working_directory = Path(os.path.abspath(''))  # Immediately stuff the string into a Path object
static_dir = working_directory / "static" / "02"

##### Raw Text I/O

Files are just a series of bytes.

Strings are also a series of bytes.

A common way to write files is using strings, which makes them human-inspectable. This is raw text: just some strings.

A "line" is a common pattern in files, but it's not required. It just makes things easier to handle.

Each line ends with a "newline" character `\n`, which is a kind of whitespace.

In [None]:
# Loading raw text
fpath: Path = static_dir / "example_text.txt"
with open(fpath, "r") as f:  # This prevents leakage!
    lines = f.readlines()
print(f"Example file contents:\n\n{lines}")  # Another use of format strings

In [None]:
# Saving raw text
fpath = static_dir / "example_text_output.txt"
lines_to_write = ["this\n", "is\n", "a\n", "file\n"]
with open(fpath, "w") as f:
    f.writelines(lines_to_write)

##### CSV I/O

CSV (Comma Separated Value) is a common way to store tables of data that can be read by a human, program, or any spreadsheet software.

!! Warning !! if you open a CSV file in Excel, it will round numbers when saved! You may only be able to tell that this has happened if the file is version-controlled!

In [None]:
# Saving CSV data
time_s = [1, 2, 3]
voltage_V = [3.1, 5.7, 1.2]

data = {"time [s]": time_s, "voltage [V]": voltage_V}

# With Pandas
df_pandas = pd.DataFrame(data)
df_pandas.to_csv(static_dir / "pandas.csv", index=False)

# With Polars
df_polars = pl.DataFrame(data)
df_polars.write_csv(static_dir / "polars.csv")

In [None]:
# Loading CSV data
import numpy as np  # This is the baseline numerics library

# With Pandas or Polars
df_pandas_new = pd.read_csv(static_dir / "polars.csv")  # Swapped files intentionally
df_polars_new = pl.read_csv(static_dir / "pandas.csv")

print(df_pandas_new["time [s]"])

assert np.all(df_pandas.values == df_pandas_new.values)
assert np.all(df_polars.to_numpy() == df_polars_new.to_numpy())

# With numpy
vals = np.genfromtxt(static_dir / "polars.csv", delimiter=",", skip_header=1)
assert np.all(vals == df_pandas.values)

##### JSON I/O

JSON (JavaScript Object Notation) files are a common way to store data that looks like nested dictionaries or lists.

It is a productive way to encode and store configuration, settings, checkpoint data, and some results.

It is also commonly used as an inter-language communication medium (for example, to communicate between a client and server).

JSON also stores only a limited number of bits of precision for 64-bit floats (53 bits) because this is the point where float resolution exceeds 1.0 and integers are no longer representable. 

In practice, this is rarely a problem. 53 is already a lot of bits - enough to represent individual nanoseconds for about 3 months, or individual seconds for 285 million years.

In [None]:
# Remember this dictionary from before?
stuff: dict[str, str] = {
    "foo": "bar",
    "bar": "baz"
}

In [None]:
# We can save it as a json!
fpath = static_dir / "stuff.json"
with open(fpath, "w") as f:
    f.write(json.dumps(stuff, indent=4))

# ... and load it from json!
with open(fpath, "r") as f:
    stuff_reloaded: dict[str, str] = json.loads(f.read())

assert stuff == stuff_reloaded

# JSON is just a string! We can carry it around internally
print(json.dumps(stuff, indent=4))


In [None]:
# We don't have to handle json I/O by hand - we can define structures and let pydantic do it
import pydantic

# Define a "schema" which is the shape of the data
class MyStuff(pydantic.BaseModel):
    the_stuff: dict[str, str]
    the_number: float

    def myfunc(self) -> float:
        return self.the_number * 5.0

# Make a concrete instance of the data
my_stuff: MyStuff = MyStuff(the_stuff=stuff, the_number=3.14)

# Write to json string!
print(my_stuff.model_dump_json(indent=4))

# Load from json string!
json_string: str = my_stuff.model_dump_json(indent=4)
my_stuff_reloaded: MyStuff = MyStuff.model_validate_json(json_string)

# Refer to fields and methods
print(my_stuff_reloaded.myfunc())

assert my_stuff == my_stuff_reloaded


##### YAML I/O

YAML (YAML Ain't Markup Language) is similar to JSON, but sometimes it looks nicer, and it can represent full float precision. It also includes some templating features that we won't discuss here.

In [None]:
import yaml

# We can convert a dictionary to a yaml just like json
yaml_string: str = yaml.safe_dump(stuff)
print(yaml_string)

In [None]:
# We can also load a json and swap to yaml
data_loaded_from_json: dict[str, str] = json.loads(json_string)
yaml_from_json: str = yaml.safe_dump(data_loaded_from_json)
print(yaml_from_json)

In [None]:
# ... or load yaml and swap to json
data_loaded_from_yaml: dict[str, str] = yaml.safe_load(yaml_from_json)
json_from_yaml: str = json.dumps(data_loaded_from_yaml, indent=4)

assert data_loaded_from_json == data_loaded_from_yaml

print(json_from_yaml)

##### TOML I/O

TOML (Tom’s Obvious Minimal Language) is similar to JSON and YAML. It is more commonly used for configuring tools, especially low-level utilities, because it is simpler than the others, and it is very very easy to write a parser for TOML as a result.

In [None]:
import toml  # Pretty good library

my_stuff_dict = json.loads(my_stuff.model_dump_json())
toml_string = toml.dumps(my_stuff_dict)
print(toml_string)

##### MATLAB I/O

Sometimes you don't have time to rewrite all that matlab.

Matlab can call Python functions, and directly take ownership of primitives returned from those functions.

For arrays and matrices, Python can write `.mat` files that matlab can read.

`.mat` files are HDF5-formatted compressed nested data structures. The HDF5 system is incredibly unreliable due to DLL naming overlap and should be avoided when possible - use `zarr` instead!

In [None]:
from scipy.io import savemat
from scipy import sparse

dense_array: np.typing.NDArray = np.array([1.0, 3.0, 2.0])
dense_mat: np.typing.NDArray = np.diag(dense_array)
sparse_mat: sparse.csc_matrix = sparse.csc_matrix(dense_mat)  # Not the most efficient way to do this

# Pack stuff to put in the file
mat_data = {
    "dense_array": dense_array,
    "dense_mat": dense_mat,
    "sparse_mat": sparse_mat,
    "some_number": 3.14
}

savemat(static_dir / "mat_from_python.mat", mat_data)

### Lists, Dictionaries, and Comprehensions

Types that represent a pile of stuff are called "collections."

Different collections have different benefits and drawbacks.

Some collections can be built using shorthand notation called "comprehension," which is essentially Set-Builder Notation written out in words.

Some collections are technically faster to access than others. This almost never matters in Python, so don't worry about it unless your stuff is so slow that it doesn't work.

| Type ----- | Access | Pros and Cons | Comprehension Pattern |
|------|--------|---------------|-----------------------|
| List[T] | O(1), ordered, fast constant term | Fast to access, slow to add elements; index by numbers | `[x for x in y]` |
| Dict[K, V] | O(1), ordered, slow constant term | Medium to access, medium to add elements; index by anything | `{key: value for key, value in y}` |
| Set[T] | One does not access. O(1) to check if a value is in the set | Guarantees uniqueness of members | `{x for x in y}` |
| numpy NDArray[T] | O(1), fast iterators | Like a list, but it can do math | `np.array([x for x in y])` |

There are many more, but these are enough for most uses.

Compared to matlab,
* Python indexes from zero, so the first element is list[0] not list[1]
* Python uses square brackets for indexing (to distinguish from function calls, which use parentheses)

##### Lists

In [None]:
# List assembly

mylist = [1, 2, 3]  # Direct
mylist_2 = [x + 1 for x in range(3)]  # Iterator and comprehension
mylist_3 = []  # Incremental is slow for larger uses
for i in range(3):  # `range` makes an iterator
    mylist_3.append(i + 1)  # List gets re-allocated every few elements

print(mylist)
print(mylist_2)
print(mylist_3)

# You can't access an index before it exists,
# so you can't assemble a list this way
try:
    mylist_4 = []
    for i in range(3):
        mylist_4[i] = i + 1  # Breaks because list[i] does not exist
except:
    print("Noop")

In [None]:
# List access

mylist[0]  # Direct indexing

print("Direct iteration on list items")
for x in mylist:  # As an iterator
    print(x)

print("\nComprehension over an existing list to make a new list")
print([x * 5 for x in mylist])

print("\nDirect indexing in a range loop")
for i in range(len(mylist)):
    print(f"At index {i}, we have {mylist[i]}")

# Slicing
print("\nSlice access to list")
print(mylist[1:]) # Values after and including element 1
print(mylist[:1]) # Values before element 1
print(mylist[:-1]) # Values up to one before the end
print(mylist[1:-1])  # Values in the middle minus the ends
print(mylist[::-1])  # Values in reverse
print(mylist[::2])  # Every second element

In [None]:
# Sorting with the builtin method `sorted`

print(sorted(mylist))
print(sorted(mylist[::-1]))  # Reverse then sort

# Sorting can also be done with a custom comparison key
import numpy as np

rand1 = np.random.uniform(0, 1, 10).reshape((5, 2)).tolist()  # List of random pairs
print("\nUnsorted:                       ", rand1)
print("Sorted by first element of pair:", sorted(rand1, key = lambda x: x[0]))  # Sort by the first element of each pair

##### Dictionaries

In [None]:
# Dictionary assembly
# Dictionaries are also called maps (hashmaps in this case), which, uh, helps for presenting
keys = ["a", "c", "b"]
values = [1, 3, 2]

mymap = {"a": 1, "c": 3, "b": 2}  # Direct notation; ordered result
mymap_2 = dict(a=1, c=3, b=2)  # Function initializer; not necessarily ordered
mymap_3 = {k: v for k, v in zip(keys, values)}  # `zip` makes an iterator over pairs of values

# For dictionaries, we can assign directly to an entry that doesn't exist yet
mymap_4 = {}
for i in range(len(keys)):
    k = keys[i]
    v = values[i]
    mymap_4[k] = v


In [None]:
# Dictionary access

one = mymap["a"]  # Direct access
items = [(k, v) for k, v in mymap.items()]  # As an iterator in a comprehension

# In a loop over the keys, values, or both
print("\nKey iterator")
for k in mymap.keys():
    print(k)

print("\nValue iterator")
for v in mymap.values():
    print(v)

print("\nItems -> both keys and values")
for k, v in mymap.items():
    print(k, v)

# Accessing a member of a dictionary that doesn't exist is an error
print("\nAccessing non-existent item")
try:
    mymap["wassup"]
except:
    print("noop")

# We can use a fallible method for access if we're not sure whether the item is there or not
print("\nFallible access to non-existent member")
k = "wassup"
val_not_here: int | None = mymap.get(k)
print(f" The value at `{k}` is missing:", val_not_here)
val_is_here: int | None = mymap.get("a")
print(" The value at `a` is present:", val_is_here)


##### Sets

In [None]:
# Sets are mostly useful for assembling unique members

mylist_repetitive = [2, 1, 1, 1, 3, 3]
list(set(mylist_repetitive))  # Note this is not in the same order!

##### Arrays

In [None]:
# Arrays are similar to matlab - lists or matrices of values that can do math
# BUT multiplication is elementwise by default, and when matrix sizes don't match, an error is produced.

arr1 = np.array([x for x in range(5)]) # 5 numbers from 0 to 4
arr2 = np.random.uniform(0, 1, 5)  # 5 random numbers between 0 and 1

# Array mult ----------------- Matrix mult
arr1 * arr2       == np.diag(arr1) @ np.atleast_1d(arr2)