# Operating system and files

## Before we start... `import`
To interact with the operating system and read files, we need to *import a module*:
```python
import os
```

- What is a **module**? A file containing `python` code (variables, function definitions, classes).
- What does it mean to **import a module**? When we import a module, `python` runs all the code in the module, as simple as that.
- Most of the times, a module only contains *definitions* and is not supposed to execute any function.


In [6]:
import os
print(os.name)

posix


## Paths
Paths identify a file on a filesystem.

Examples:
- Windows path: `C:\User\Documents\file.ext`
- Linux path: `/home/user/file`.

In Windows, most filenames have extensions and extensions is how the OS determines the file type.
In Linux, file type and extension are unrelated at a fundamental level but extensions are of help to the user and applications.

White it is technically possible to manipulate paths as strings, **don't do it**. It's messy, ugly and does not play well across different operating systems!

### Python paths, the old way


In [7]:
# Let's find out our current directory
base_dir = os.getcwd()
print(base_dir)

/home/lincetto/Work/py4phys-2022/notebooks


Now let's define a new directory...

In [8]:
new_dir = os.path.join(base_dir, 'work')
print(new_dir)

/home/lincetto/Work/py4phys-2022/notebooks/work


We have our path, let's create it!

In [9]:
# if we try to create a directory with an existing name, we get an error
if not os.path.exists(new_dir):
    os.makedirs(new_dir)

print(type(new_dir))

<class 'str'>


### Python paths, the cool way
We have noticed that, after all, we are still manipulating a path as a string. Can we do better?

In [10]:
from pathlib import Path

In [11]:
base_dir = Path(base_dir)
print(base_dir)
type(base_dir)

/home/lincetto/Work/py4phys-2022/notebooks


pathlib.PosixPath

In [12]:
new_dir = base_dir / "work"

Path.mkdir(new_dir, exist_ok=True)

- We can manipulate paths as objects.
- Better functionality in the form of class and instance methods of `Path`.
- Awesome `/` operator! 

## First file: a text file

In [21]:
# first, some data
names = ["NGC 5128", "TXS 0506+056", "NGC 1068", "GB6 J1040+0617", "TXS 2226-184"]
distances = [3.7, 1.75e3, 14.4, 1.51e4, 107.1]  # Mpc
luminosities = [1e40, 3e46, 4.9e38, 6.2e45, 5.5e41] # erg/s

dataset = { 'names' : names, 'distances' : distances, 'luminosities' : luminosities }

In [16]:
filename = 'datafile.dat'

filepath = new_dir / filename

with open(filepath, 'w') as f:
    for string in names:
        f.write(string + '\n')

# no need to explicitly close the file!

In [17]:
with open(filepath, 'r') as f:
    data = f.read()

print(data)

NGC 5128
TXS 0506+056
NGC 1068
GB6 J1040+0617
TXS 2226-184



If the file is really big, this is not ideal because all the file content gets loaded in a variable (on the RAM). Better to read line by line:

In [18]:
with open(filepath, 'r') as f:
    for line in f:
        print(line)

# you can also used f.readline() to read one line at a time

NGC 5128

TXS 0506+056

NGC 1068

GB6 J1040+0617

TXS 2226-184



By default, the file is opened in text mode, means that:
- only characters/string can be written;
- everything is read as a character.

## Binary files
- Writing binary content by hand is complicate and messy.
- In `python` we can use `pickle` to dump an arbitary object into a file.

In [23]:
import pickle

with open(filepath, 'wb') as f:
    pickle.dump(dataset, f)


In [25]:
with open(filepath, 'rb') as f:
    obj = pickle.load(f)

print(obj)

{'names': ['NGC 5128', 'TXS 0506+056', 'NGC 1068', 'GB6 J1040+0617', 'TXS 2226-184'], 'distances': [3.7, 1750.0, 14.4, 15100.0, 107.1], 'luminosities': [1e+40, 3e+46, 4.9e+38, 6.2e+45, 5.5e+41]}


Works with basically any object (even your own classes), but it also very opaque:
- `python` specific, no cross-language standard;
- basically you need to know in advance what's inside the file;
- writing and reading iteratively is possible but complicate.

## The magic of JSON
- JSON (JavaScript Object Notation) is a standard encoding format that allows to write multiple data types in the form of a text file.
- You can think of a JSON file as a big nested dictionary.
- Most `python` native data types can be written as a JSON file. 

In [29]:
import json

with open(filepath, 'w') as f:
    json_data = json.dumps(dataset) # dumps() returns a string
    json.dump(data, f) # dump() writes to file!

In [28]:
print(json_data)
type(json_data)

{"names": ["NGC 5128", "TXS 0506+056", "NGC 1068", "GB6 J1040+0617", "TXS 2226-184"], "distances": [3.7, 1750.0, 14.4, 15100.0, 107.1], "luminosities": [1e+40, 3e+46, 4.9e+38, 6.2e+45, 5.5e+41]}


str

- It seems like python syntax, but this is JSON.
- The file is human-readable!

In [31]:
with open(filepath, 'r') as f:
    obj = json.load(f)

print(obj)
type(obj) # original type is restored!

{'names': ['NGC 5128', 'TXS 0506+056', 'NGC 1068', 'GB6 J1040+0617', 'TXS 2226-184'], 'distances': [3.7, 1750.0, 14.4, 15100.0, 107.1], 'luminosities': [1e+40, 3e+46, 4.9e+38, 6.2e+45, 5.5e+41]}


dict

## CSV
CSV is acronym for "comma separated values", it is the format of choice for tabular data. A CSV files consists of lines (entries) where different values (fields) are separated by commas.  

In [32]:
import csv

In [38]:
with open(filepath, 'w') as f:
    writer = csv.DictWriter(f, fieldnames=["name","distance", "luminosity"])
    for name, dist, lum in zip(names, distances, luminosities):
        writer.writerow({"name": name, "distance": dist, "luminosity": lum})

In [37]:
with open(filepath, 'r') as f:
    reader = csv.DictReader(f, fieldnames=["name","distance", "luminosity"])
    for row in reader:
        print(row)

{'name': 'NGC 5128', 'distance': '3.7', 'luminosity': '1e+40'}
{'name': 'TXS 0506+056', 'distance': '1750.0', 'luminosity': '3e+46'}
{'name': 'NGC 1068', 'distance': '14.4', 'luminosity': '4.9e+38'}
{'name': 'GB6 J1040+0617', 'distance': '15100.0', 'luminosity': '6.2e+45'}
{'name': 'TXS 2226-184', 'distance': '107.1', 'luminosity': '5.5e+41'}
