# Python for SysAdmins – Interaction with the file system (1)

**just in case...**

* ... you wonder where you can find this script: it is available at [Github](https://github.com/eth-its/Python-for-SysAdmins/tree/main/ws2)
* ... you have a mess with your Python installation: [Python Best Practices](https://gitlab.ethz.ch/vermeul/python-best-practices) might help

## Modules

### The standard library

Python comes with a lot of pre-installed modules (standard Python library) which greatly extend the language.

Visit the [Python Module of the Week](https://pymotw.com/3/) website to get a good overview. All these modules are directly shipped with Python, hence «batteries included».

For dealing with files and directories, we are going to look into a few modules of the standard library:

- [`os`](https://docs.python.org/3/library/os.html) – operating system interactions
- [`sys`](https://docs.python.org/3/library/sys.html) – Python runtime environment manipulation
- [`pathlib`](https://docs.python.org/3/library/pathlib.html) – object-oriented filesystem paths
- [`shutil`](https://docs.python.org/3/library/shutil.html) – High-level file operation
- [`re`](https://docs.python.org/3/library/re.html) – Regular Expressions
- [`json`](https://docs.python.org/3/library/json.html) – JSON encoder and decoder
- [`csv`](https://docs.python.org/3/library/csv.html) – CSV files, reading and writing

### more modules! The Python Package Index PyPi

Python comes already with pre-installed modules. But there is much more! The Python Package Index (https://pypi.org) hosts thousands of additional modules which solve almost all possible everyday problems. Simply use the `pip` command line tool, which is being shipped with Python, to install them.

You can put an exclamation mark `!` at the beginning of a code cell to execute the command within Jupyter in a shell. **Example:**

In [None]:
!pip install pandas

The followng does the same, just with the `-m` parameter to tell Python to use a specific module:

In [None]:
!python3 -m pip install pandas

### list of installed modules

Sometimes you need to know which packages you've installed so far, and which versions you used.
If you distribute your script, you want to put these in some configuration:
* `pyproject.toml` (modern approach)
* `pyproject.toml` + `setup.cfg` (almost modern approach)
* `pyproject.toml` + `setup.py` (conservative approach)
* `requirements.txt` (old school / data science)

More info: https://setuptools.pypa.io/en/latest/userguide/quickstart.html

Modern approach:

`pyproject.toml`
```toml
[project]
name = "mypackage"
version = "0.0.1"
dependencies = [
    "requests",
    'importlib-metadata; python_version<"3.8"',
    "click>=8.0"
]
```

Reproduce setup:
```sh
pip install .
```

Old school / data science:

In [None]:
!pip freeze > requirements.txt

then later, people can install exactly the same modules in their exact versions, like this:

```sh
pip install -r requirements.txt
```

### How modules are imported

At the beginning of most Python files, you will see a list of `import` statements.

With the `import` statement we tell Python look for a module and treat that module like a variable.

The **order** where Python looks for modules in the system is as follows:

1. look in the current path
2. look in the paths specified by the `PYTHONPATH` environment variable, if this variable exists
3. look in the standard library path (`lib/python3.x/`)
4. look in the path where all external modules, including those from [pypi.org](https://pypi.org), are installed (usually in `lib/python3.x/site-packages`)

In [None]:
import os
print("the variable 'os' contains a module:", os)

In [None]:
import click  # a module installed from pypi.org
print("'click' can be found here: ", click)

`sys.path` tells us where Python is looking for modules

In [None]:
import sys
sys.path

### Write your own modules: the infamous `__init__.py` file

There is a Python file [my_hello_world.py](my_module/my_hello_world.py) inside the subdirectory [my_module](my_module).
We can tell Python to refer to that file inside that folder by using the `from <folder> import <module>` syntax:

In [None]:
from my_module import my_hello_world
my_hello_world.say_hello("World!")

If you have many functions in separate `.py` files, that might get cumbersome to import them. It is easier to bundle them into a module and present it to the user. That's where the `__init__.py` file becomes important. With this file, you can treat the whole directory `my_module` like as it was a Python file:

In [None]:
import my_module   # this loads my_module/__init__.py
my_module.my_hello_world.say_hello("World!")

Inside `my_module/__init__.py` I added these lines:

```python
from .my_hello_world import say_hello
from .my_upper_hello_world import say_hello_upper
```

With these lines, I can «publish» certain functions directly, as they are directly attached to the module:

In [None]:
import my_module               # this loads my_module/__init__.py
my_module.say_hello("World!")
my_module.say_hello_upper("What", "a", "beautiful", "world!")

### manipulate `PYTHONPATH` during runtime

Because the interpreter is already started, we can no longer specify the `PYTHONPATH` variable, but we can change the content of `sys.path`:

In [None]:
import sys
# put "my_module" in the front of everything else
sys.path.insert(0, "my_module")

now we can import the module directly:

In [None]:
import my_hello_world
my_hello_world.say_hello("This", "is", "Python!")

Often, the module name is rather long to type, so we give it an **alias**:

In [None]:
import my_hello_world as mhw
mhw.say_hello("this", "works", "too!")

### A few Jupyter tricks

**put a question mark ? directly after any method or module name** and execute the cell to receive the so called _docstring_. 

In [None]:
import os
os?

It becomes especially handy if you can't remember the parameters that you need to provide:

In [None]:
print?

Even more handy is to hit `shift-TAB` when you are inside the brackets of a function or a method:

In [None]:
print()


**use Jupyter’s TAB completion to list all methods**

enter the following cell, then hit the tabulator key after the dot: a list of possible methods will appear as a vertical list.

In [None]:
os.path.

**Use your keyboard to navigate**

Jupyter has an insert and a browse (or normal) mode, like vim. Hit the `escape` key to enter browse mode, then use:

- the `K` and `J` keys to go up and down
- `d d` to delete a cell
- `z` to undo a deletion
- `b` to insert a cell below
- `a` to insert a cell above

Hit `Enter` to switch back to the insert mode.

## The `sys` module

This module shows a lot of information about the Python interpreter itself.

In [None]:
sys.version

In [None]:
sys.version_info

A typical example how we can avoid a script from being executed with the **wrong Python interpreter**:

In [None]:
if sys.version_info < (3,7):
    sys.exit('Sorry, Python < 3.7 is no longer maintained')

In [None]:
sys.executable

**Add the locally installed packages to the PYTHONPATH**

In [None]:
sys.path

In [None]:
import pandas  # this fails if pandas is only installed in user-space

By adding the path of our locally installed packages to `sys.path`, we can make it run: 

In [None]:
import os
import sys
# we use os.path.expanduser to change ~ into /home/user_x
sys.path.append(os.path.expanduser('~/.local/lib/python3.10/site-packages'))

Now we can test whether the import works:

In [None]:
import pandas

## the `os` module

### Environment variables

In [None]:
os.environ

`os.environ` returns a dictionary. To safely fetch an item (without generating a KeyError), we use the `.get(key, default)` syntax:

In [None]:
os.environ['PYTHONPATH']   # throws a KeyError if PYTHONPATH is not defined

In [None]:
os.environ.get('PYTHONPATH', '')

### current working directory 

In [None]:
os.getcwd()

**all files in a directory**

In [None]:
os.listdir('.')

**create, rename and delete a file**

In [None]:
!touch _testfile

In [None]:
os.path.exists('_testfile')

In [None]:
os.rename('_testfile', 'testfile')

In [None]:
os.path.exists('_testfile')

In [None]:
os.remove('testfile')

### setting file access permissions: `chmod`

In [None]:
!touch _test_file_permissions

In [None]:
os.stat('_test_file_permissions')

In [None]:
os.stat('_test_file_permissions').st_mode

get the octal representation of the file permission

In [None]:
oct(os.stat('_test_file_permissions').st_mode)

shorten the octal representation

In [None]:
oct(os.stat('_test_file_permissions').st_mode & 0o777)

change file permissions

In [None]:
os.chmod('_test_file_permissions', 0o666)
oct(os.stat('_test_file_permissions').st_mode & 0o777)

In [None]:
os.remove('_test_file_permissions')

### change file ownership: `chown`

In [None]:
!touch _test_file_ownership

In [None]:
os.stat('_test_file_ownership').st_uid

In [None]:
os.stat('_test_file_ownership').st_gid

In [None]:
os.getgroups()

In [None]:
os.chown('_test_file_ownership', os.getuid(), 400)

In [None]:
os.stat('_test_file_ownership').st_gid

In [None]:
os.remove('_test_file_ownership')

### working with directories

In [None]:
os.mkdir('tmp')

In [None]:
os.makedirs('tmp2/some/more/dirs')

use `os.path.join` to safely join subfolders:

In [None]:
long_path = os.path.join('tmp3/','even/more', 'dirs')
print(long_path)

In [None]:
os.makedirs(long_path)

remove a single (empty) folder

In [None]:
os.rmdir('tmp')

**Remove empty nested folders**: `os.removedirs` will delete all subfolders:

In [None]:
os.removedirs(long_path)

But: does it?

In [None]:
!touch tmp2/this_file_will_survive

In [None]:
os.removedirs('tmp2/some/more/dirs')

No. It **silently fails**, because we have a file somewhere...

In [None]:
os.listdir('tmp2')

**Conclusion: the os.path library is not always the best solution, look for alternatives**

In our case, the `shutil` module does it right:

In [None]:
import shutil
shutil.rmtree('tmp2', ignore_errors=True)

### recursively walk a tree

In [None]:
os.makedirs('walk/down/the/tree')

In [None]:
!touch walk/walk01
!touch walk/walk02
!touch walk/down/down01
!touch walk/down/down02
!touch walk/down/the/tree/tree01
!touch walk/down/the/tree/tree02

In [None]:
for dir_path, dir_names, file_names in os.walk('walk'):
    for filename in file_names:
        print(os.path.join(dir_path, filename))

This is doable, but a bit cumbersome, since we have to join the directory path `dir_path` again with the `os.path.join` command. Next, we are going to look at alternatives which might work better for you.