### The `pathlib` Module

The Python docs for this module are available [here](https://docs.python.org/3/library/pathlib.html#module-pathlib)

For working with concrete paths, in a way that will work on both Windows and Unix based systems, the `pathlib` module, part of the standard library will work nicely.

We'll first look at some path manipulation techniques, and finish off with creating/deleting directories, and the very important `glob` function to (optionally recursively) find all files matching some pattern.

The classes in this module can be divided into two types:
- pure paths
- concrete paths

The "pure paths" provide ways to manipulate paths, but do not actually access the file system. 

You have pure paths for both Windows (`PureWindowsPath`) and Posix (like Linux, Mac) (`PurePosixPath`), or a more generic `PurePath` that will decide whether to use Windows or Posix depending on the system it is running on. You can of course always instantiate a `PureWindowsPath` from a script running on a Posix platform to allow manipulating Windows paths, and vice-versa.

The "concrete paths" are used not only to manipulate paths, but also to interact with the file system itself.

Again, there are both Windows (`WindowsPath`) and Posix variants (`PosixPath`). And just as with pure paths, there is also a more generic `Path` class which will automatically base itself on the platform the script is running on (i.e. you'll end up either with a `WindowsPath` instance, or a `PosixPath` instance depending on your OS).

The concrete classes also inherit from their respective pure classes (so we have not only the pure path methods/properties available to us, but also the concrete ones).

It is this `Path` class we are going to explore here. It is most commonly used as it essentially provides us, the Python developer, a relatively platform-independent way of dealing with the OS file system.

First, let's import the `Path` class from the `pathlib` module:

In [1]:
from pathlib import Path

To specify paths, we can use fully qualified (or absolute) paths, as well as relative paths (where `.` denotes the current directory, while `..` is used to denote the parent directory).

We can get a `Path` object to our current path (that the script is running in) using the concrete `cwd()` method:

In [2]:
curr = Path.cwd()

The first thing we can look at is the actual class of our path. Since I'm running this on a Mac, I'll get a posix path:

In [3]:
curr

PosixPath('/Users/fbaptiste/dev/python-blog/pending_release')

If you're running on Windows (not the WSL), you will end up with a `WindowsPath` object.

We could also use relative paths to get a Path object to our current path:

In [4]:
curr2 = Path('.')

What's interesting is that although both paths are essentially the same absolute path, they are not "equal" paths:

In [5]:
curr2

PosixPath('.')

In [6]:
curr == curr2

False

But, the absolute (fully qualified) paths are the same:

In [7]:
curr.absolute() == curr2.absolute()

True

We can also use the `samefile()` method to check if two paths are pointing to the same thing:

In [8]:
curr.samefile(curr2)

True

In both Windows and Posix, there is the concept of a "home" directory, which we can obtain using the `home()` method:

In [9]:
Path.home()

PosixPath('/Users/fbaptiste')

Now a path does not need to be a directory, it could be a file as well. (In fact, a path could point to other resources too, such as sockets, symlinks, block devices, etc, depending on the OS)

Let's create a quick file and experiment with this:

In [10]:
with open('test.txt', 'w') as f:
    f.write("Testing the Python Path object.")

Now let's get a `Path` object for this file:

In [11]:
file = Path('test.txt')

In [12]:
file

PosixPath('test.txt')

In [13]:
file.absolute()

PosixPath('/Users/fbaptiste/dev/python-blog/pending_release/test.txt')

We can use the `stat()` method to get some meta information about this file:

In [14]:
file_stat = file.stat()

In [15]:
file_stat

os.stat_result(st_mode=33188, st_ino=23966961, st_dev=16777232, st_nlink=1, st_uid=501, st_gid=0, st_size=31, st_atime=1662263924, st_mtime=1662263924, st_ctime=1662263924)

Some of this information will be platform dependent, but we can easily recover the file size (in bytes):

In [16]:
file_stat.st_size

31

as well as when the file was last modified:

In [17]:
file_stat.st_mtime

1662263924.4755654

This is is a POSIX time (in seconds) - we can convert this to an actual datetime object:

In [18]:
from datetime import datetime

In [19]:
datetime.fromtimestamp(file_stat.st_mtime)

datetime.datetime(2022, 9, 3, 20, 58, 44, 475565)

Note that this is expressed in **local** time.

You can find more info on this stat structure [here](https://docs.python.org/3/library/os.html#os.stat_result).

You can also easily determine if a path exists or not:

In [20]:
Path('test.txt').exists()

True

In [21]:
Path('test2.txt').exists()

False

You probably know that path definitions in Windows and Posix are quite different. For example Windows uses the `\` as a path delimiter, whereas Posix uses `/`.

One difficulty is therefore how to define paths that will work on both Windows and Posix, without writing a bunch of conditional statements.

The nice thing about the classes in this module is that all paths are specified using the Posix syntax, so `/` only (also saves us having to escape `\` characters in our strings).

So, for both Windows and Posix, we can define a path this way:

In [22]:
Path('./dir1/dir2')

PosixPath('dir1/dir2')

Of course, in Windows you may need to use a drive letter, but you still use `/` for the path:

In [23]:
Path('c:/dir1/dir2')

PosixPath('c:/dir1/dir2')

From this point forward I'll be using Posix paths since I am running on a Mac, so you may need to adjust things for a Windows OS if that happens to be your platform. But things pretty much work the same way.

Continuing with our `exists()` example, we can also test the existence of a directory, we are not limited to just files:

In [24]:
Path('/Users/fbaptiste').exists()

True

In [25]:
Path('/Users/isaac-newton').exists()

False

So, given a path, you can not only test if it exists or not (as we just saw), but also determine if it is a file path or a directory path, using the `is_dir()` and `is_file()` methods.

Beware though, that both these methods will return `False` if the specified path does not exist!

In [26]:
Path('test.txt').is_dir(), Path('test.txt').is_file()

(False, True)

In [27]:
Path('/Users').is_dir(), Path('/Users').is_file()

(True, False)

On the other hand, the following two examples use non-existent paths, and all methods simpy return `False`:

In [28]:
Path('test2.txt').is_dir(), Path('test2.txt').is_file()

(False, False)

In [29]:
Path('/abc').is_dir(), Path('/abc').is_file()

(False, False)

> Basically don't assume that just because `is_file` returns `False` then the path is therefore a directory. First of all, there are other kinds of path objects (e.g. sockets) that are neither, but it could also be that the specified path simply does not exist.

Instead of building a path using a string as we have done so far, we can use the `joinpath()`:

In [30]:
p = Path('Users').joinpath('dir1', 'dir2', 'dir3')
p

PosixPath('Users/dir1/dir2/dir3')

You can even use Path objects as arguments to `joinpath()`:

In [31]:
p = Path.home().joinpath('dir1', Path('dir2'), 'dir3')
p

PosixPath('/Users/fbaptiste/dir1/dir2/dir3')

Alternatively, the `/` operator can be used to join paths as well:

In [32]:
p = Path.home() / 'dir1' / Path('dir2') / 'test.txt'
p

PosixPath('/Users/fbaptiste/dir1/dir2/test.txt')

Opposite to combining paths to form new paths, it is also easy to decompose a path into its individual parts. 

In [33]:
p = Path.home() / 'dir1' / 'dir2'/ 'test.txt'

In [34]:
p.parts

('/', 'Users', 'fbaptiste', 'dir1', 'dir2', 'test.txt')

And for a Windows path:

In [35]:
from pathlib import PureWindowsPath

In [36]:
windows_p = PureWindowsPath('c:/') / 'dir1' / 'dir2'/ 'test.txt'
windows_p

PureWindowsPath('c:/dir1/dir2/test.txt')

By the way the `str` representation of a Windows path will give you the "real" Windows path (notice how the `\` characters need to be escaped, using `\\`):

In [37]:
str(windows_p)

'c:\\dir1\\dir2\\test.txt'

If you do not want to see th escape characters, you can just `print` the string:

In [38]:
print(str(windows_p))

c:\dir1\dir2\test.txt


In [39]:
windows_p.parts

('c:\\', 'dir1', 'dir2', 'test.txt')

Related to this you can get the logical ancestors of a path as a sequence:

In [40]:
p

PosixPath('/Users/fbaptiste/dir1/dir2/test.txt')

In [41]:
for idx, parent in enumerate(p.parents):
    print(f"{idx}: {parent}")

0: /Users/fbaptiste/dir1/dir2
1: /Users/fbaptiste/dir1
2: /Users/fbaptiste
3: /Users
4: /


And the same with a Windows path:

In [42]:
windows_p

PureWindowsPath('c:/dir1/dir2/test.txt')

In [43]:
for idx, parent in enumerate(windows_p.parents):
    print(f"{idx}: {parent}")

0: c:\dir1\dir2
1: c:\dir1
2: c:\


(if you're wondering why you're not seeing `\\`, this is because I used the `print` statement which processes the escaped characters before showing the results on the screen - the same way if would process a `\n`)

Or, if you're only interested in the immediate parent, you can just use the `parent` property:

In [44]:
p, p.parent

(PosixPath('/Users/fbaptiste/dir1/dir2/test.txt'),
 PosixPath('/Users/fbaptiste/dir1/dir2'))

You can recover the last part of the path, often used for extracting a file name from a file path:

In [45]:
p

PosixPath('/Users/fbaptiste/dir1/dir2/test.txt')

In [46]:
p.name

'test.txt'

As well as the file extension of the final component (if it is specified):

In [47]:
p.suffix

'.txt'

For a directory path:

In [48]:
dir_p = Path.home() / 'dir1'
dir_p

PosixPath('/Users/fbaptiste/dir1')

In [49]:
dir_p.name

'dir1'

In [50]:
dir_p.suffix

''

There are many many more methods to manipulate paths, so I urge you to look at the Python docs [here](https://docs.python.org/3/library/pathlib.html#module-pathlib)

Now let's turn our attention to some interesting concrete path methods.

The first thing is that you can open a file from a concrete Path using the `open` function the same way you use it using a string for your path:

In [51]:
with open('test.txt') as f:
    print(f.readlines())

['Testing the Python Path object.']


In [52]:
p = Path.cwd() / 'test.txt'
p

PosixPath('/Users/fbaptiste/dev/python-blog/pending_release/test.txt')

In [53]:
with open(p) as f:
    print(f.readlines())

['Testing the Python Path object.']


You can also create and delete directories:

In [54]:
p = Path.cwd() / 'dir1'
p.exists()

False

As you can see the directory path does not exist, but we can create it using `mkdir`:

In [55]:
p.mkdir()
p.exists()

True

Note that you will get an exception if you try to create a directory that already exists:

In [56]:
try:
    p.mkdir()
except FileExistsError as ex:
    print(ex)

[Errno 17] File exists: '/Users/fbaptiste/dev/python-blog/pending_release/dir1'


And we can also delete it (assuming it exists, otherwise we'll get an exception):

In [57]:
p.rmdir()
p.exists()

False

In [58]:
try:
    p.rmdir()
except FileNotFoundError as ex:
    print(ex)

[Errno 2] No such file or directory: '/Users/fbaptiste/dev/python-blog/pending_release/dir1'


Note that `rmdir` will only work on **empty** directories.

The `mkdir()` method also allows you to create a path, **and** all the necessary parents, but not by default:

In [59]:
p = Path.cwd() / 'dir1' / 'dir2' / 'dir3'
p

PosixPath('/Users/fbaptiste/dev/python-blog/pending_release/dir1/dir2/dir3')

In [60]:
try:
    p.mkdir()
except FileNotFoundError as ex:
    print(ex)

[Errno 2] No such file or directory: '/Users/fbaptiste/dev/python-blog/pending_release/dir1/dir2/dir3'


This happens because `dir1` and `dir1/dir2` do not exist.

We can instruct Python to create the directories as needed by using the `parents=True` argument:

In [61]:
p.mkdir(parents=True)

And now all these directories exist:

In [62]:
(Path.cwd() / 'dir1').exists()

True

In [63]:
(Path.cwd() / 'dir1' / 'dir2').exists()

True

In [64]:
(Path.cwd() / 'dir1' / 'dir2' / 'dir3').exists()

True

We cannot simply delete these directories by deleting `dir1`, since `rmdir` only works on empty directories:

In [65]:
try:
    Path.cwd().joinpath('dir1').rmdir()
except OSError as ex:
    print(ex)

[Errno 66] Directory not empty: '/Users/fbaptiste/dev/python-blog/pending_release/dir1'


You will have to write code to delete the directories from the bottom up. I won't do that here, since a bug in the code could easily wipe out more than you expect - so use any kind of recursive deletion code with extreme care!

We sometimes want to iterate, or get a list, of all the paths contained in some path in a concrete sense (that is Python examines the file system to come up with that list):

For that we can use the `iterdir` generator method (you may need to change the path to suit your particular environment):

In [66]:
root = Path.cwd().joinpath('..')
root, root.exists()

(PosixPath('/Users/fbaptiste/dev/python-blog/pending_release/..'), True)

In [67]:
for path in root.iterdir():
    print(path)

/Users/fbaptiste/dev/python-blog/pending_release/../2022
/Users/fbaptiste/dev/python-blog/pending_release/../video.ipynb
/Users/fbaptiste/dev/python-blog/pending_release/../.DS_Store
/Users/fbaptiste/dev/python-blog/pending_release/../LICENSE
/Users/fbaptiste/dev/python-blog/pending_release/../pyproject.toml
/Users/fbaptiste/dev/python-blog/pending_release/../todo
/Users/fbaptiste/dev/python-blog/pending_release/../README.md
/Users/fbaptiste/dev/python-blog/pending_release/../Pipfile
/Users/fbaptiste/dev/python-blog/pending_release/../pending_release
/Users/fbaptiste/dev/python-blog/pending_release/../.gitignore
/Users/fbaptiste/dev/python-blog/pending_release/../.ipynb_checkpoints
/Users/fbaptiste/dev/python-blog/pending_release/../.git
/Users/fbaptiste/dev/python-blog/pending_release/../Pipfile.lock
/Users/fbaptiste/dev/python-blog/pending_release/../.idea


I may be interested only in the sub directories, in which case we can easily filter directories out, using a `filter` function (to keep the generator aspect), a list comprehension, or even just an `if` statement in a loop:

In [68]:
for item in root.iterdir():
    if item.is_dir():
        print(item)

/Users/fbaptiste/dev/python-blog/pending_release/../2022
/Users/fbaptiste/dev/python-blog/pending_release/../todo
/Users/fbaptiste/dev/python-blog/pending_release/../pending_release
/Users/fbaptiste/dev/python-blog/pending_release/../.ipynb_checkpoints
/Users/fbaptiste/dev/python-blog/pending_release/../.git
/Users/fbaptiste/dev/python-blog/pending_release/../.idea


In [69]:
[item for item in root.iterdir() if item.is_dir()]

[PosixPath('/Users/fbaptiste/dev/python-blog/pending_release/../2022'),
 PosixPath('/Users/fbaptiste/dev/python-blog/pending_release/../todo'),
 PosixPath('/Users/fbaptiste/dev/python-blog/pending_release/../pending_release'),
 PosixPath('/Users/fbaptiste/dev/python-blog/pending_release/../.ipynb_checkpoints'),
 PosixPath('/Users/fbaptiste/dev/python-blog/pending_release/../.git'),
 PosixPath('/Users/fbaptiste/dev/python-blog/pending_release/../.idea')]

In [70]:
for item in filter(lambda item: item.is_dir(), root.iterdir()):
    print(item)

/Users/fbaptiste/dev/python-blog/pending_release/../2022
/Users/fbaptiste/dev/python-blog/pending_release/../todo
/Users/fbaptiste/dev/python-blog/pending_release/../pending_release
/Users/fbaptiste/dev/python-blog/pending_release/../.ipynb_checkpoints
/Users/fbaptiste/dev/python-blog/pending_release/../.git
/Users/fbaptiste/dev/python-blog/pending_release/../.idea


The `iterdir()` method is **not recursive**, so you would need to write your own recursive code to recursively list all paths starting at some root.

But for this specific scenario, Python gives us the `glob` method - an incredibly useful method to recursively (or not) list paths that match a specific pattern, starting at some root.

Let's say I want to see all the files in my current project that end with `.py`:

First, I'm going to set my root path (again you may need to tweak this based on your directory structure)

In [71]:
root = Path.cwd().parent
root.absolute()

PosixPath('/Users/fbaptiste/dev/python-blog')

Now, to list all `.py` paths:

In [72]:
for path in root.glob('*.py'):
    print(path)

As you can see there are no paths ending in `.py` in that directory - just to make sure this is actually working, let's look for something that I know does exist:

In [73]:
for path in root.glob('*.toml'):
    print(path)

/Users/fbaptiste/dev/python-blog/pyproject.toml


The `*` here basically says *match everything* - but as you can see, this is not recursive.

You can see standard path name pattern matching possibilities [here](https://docs.python.org/3/library/fnmatch.html#module-fnmatch)

The basics are:
- * matches any number of characters
- ? matches a single parameter

In addition, `glob` also supports the pattern `**` which means **this directory and all subdirectories**.

For example, to find all the `.py` files recursively:

In [74]:
for path in root.glob('**/*.py'):
    print(path)

/Users/fbaptiste/dev/python-blog/2022/08 - August/click/setup.py
/Users/fbaptiste/dev/python-blog/2022/08 - August/click/main.py
/Users/fbaptiste/dev/python-blog/2022/08 - August/click/converters/csv_converter.py
/Users/fbaptiste/dev/python-blog/2022/08 - August/click/converters/cli.py
/Users/fbaptiste/dev/python-blog/2022/08 - August/click/viewers/enums.py
/Users/fbaptiste/dev/python-blog/2022/08 - August/click/viewers/json_viewer.py
/Users/fbaptiste/dev/python-blog/2022/08 - August/click/viewers/__init__.py
/Users/fbaptiste/dev/python-blog/2022/08 - August/click/viewers/csv_viewer.py
/Users/fbaptiste/dev/python-blog/2022/08 - August/black_isort/mod1.py
/Users/fbaptiste/dev/python-blog/2022/08 - August/black_isort/mod2.py
/Users/fbaptiste/dev/python-blog/2022/08 - August/black_isort/badly_formatted.py
/Users/fbaptiste/dev/python-blog/2022/06 - June/async-producer-consumer/controller.py
/Users/fbaptiste/dev/python-blog/2022/06 - June/async-producer-consumer/resulthandler.py
/Users/fbap

Here's another example, where I want to find all the paths than contain `py` in their extension:

In [75]:
for path in root.glob('**/*.*py*'):
    print(path)

/Users/fbaptiste/dev/python-blog/video.ipynb
/Users/fbaptiste/dev/python-blog/.ipynb_checkpoints
/Users/fbaptiste/dev/python-blog/2022/09 - September/defaultdict.ipynb
/Users/fbaptiste/dev/python-blog/2022/07 - July/pydantic.ipynb
/Users/fbaptiste/dev/python-blog/2022/07 - July/pairwise-iteration-using-zip.ipynb
/Users/fbaptiste/dev/python-blog/2022/07 - July/type-hinting.ipynb
/Users/fbaptiste/dev/python-blog/2022/07 - July/case-insensitive-string-comparisons.ipynb
/Users/fbaptiste/dev/python-blog/2022/07 - July/.ipynb_checkpoints
/Users/fbaptiste/dev/python-blog/2022/07 - July/unicode.ipynb
/Users/fbaptiste/dev/python-blog/2022/07 - July/comparing-lists.ipynb
/Users/fbaptiste/dev/python-blog/2022/07 - July/concatenating-sequences.ipynb
/Users/fbaptiste/dev/python-blog/2022/07 - July/.ipynb_checkpoints/case-insensitive-string-comparisons-checkpoint.ipynb
/Users/fbaptiste/dev/python-blog/2022/07 - July/.ipynb_checkpoints/unicode-checkpoint.ipynb
/Users/fbaptiste/dev/python-blog/2022/07

You can see this includes all different kinds of extensions:

In [76]:
extensions = {p.suffix for p in root.glob('**/*.*py*') if p.suffix}
extensions

{'.ipynb', '.py', '.pyc'}

Note that pattern matching with glob is not like regex - it is mainly intended to be used with "inclusion" patterns (e.g. what matches should be included, and to a far lesser extent what not to include). For more advanced filtering you should glob all the things you want to include, and then use secondary logic (possibly even regex) to filter out what you really want.