In [2]:
import os

os.chdir('/content/drive/MyDrive/Colab Notebooks/Pathlib-python')
os.listdir()

['google-download-images', 'pathlib-notebook.ipynb']

For example, when I start small projects, I create in and out directories as subdirectories under the current working directory (using os.getcwd() ). I use those directories to store the working input and output files. Here’s what that code would look like:

In [4]:
import os

in_dir = os.path.join(os.getcwd(), "in")
out_dir = os.path.join(os.getcwd(), "out")
in_file = os.path.join(in_dir, "input.xlsx")
out_file = os.path.join(out_dir, "output.xlsx")

display(in_dir, out_dir, in_file, out_file)

'/content/drive/MyDrive/Colab Notebooks/Pathlib-python/in'

'/content/drive/MyDrive/Colab Notebooks/Pathlib-python/out'

'/content/drive/MyDrive/Colab Notebooks/Pathlib-python/in/input.xlsx'

'/content/drive/MyDrive/Colab Notebooks/Pathlib-python/out/output.xlsx'

Let’s see what it looks like if we use the pathlib module.

In [5]:
from pathlib import Path

in_file_1 = Path.cwd() / "in" / "input.xlsx"
out_file_1 = Path.cwd() / "out" / "output.xlsx"

display(in_file_1, out_file_1)

PosixPath('/content/drive/MyDrive/Colab Notebooks/Pathlib-python/in/input.xlsx')

PosixPath('/content/drive/MyDrive/Colab Notebooks/Pathlib-python/out/output.xlsx')

Additionally, if you don’t like the syntax above, you can chain multiple parts together using joinpath :

In [6]:
in_file_3 = Path.cwd().joinpath("in").joinpath("input.xlsx")
out_file_3 = Path.cwd().joinpath("out").joinpath("output.xlsx")

display(in_file_3, out_file_3)

PosixPath('/content/drive/MyDrive/Colab Notebooks/Pathlib-python/in/input.xlsx')

PosixPath('/content/drive/MyDrive/Colab Notebooks/Pathlib-python/out/output.xlsx')

Finally, there is one other trick you can use to build up a path with multiple directories:

In [7]:
parts = ["in", "input.xlsx"]
in_file_4 = Path.cwd().joinpath(*parts)

display(in_file_4)

PosixPath('/content/drive/MyDrive/Colab Notebooks/Pathlib-python/in/input.xlsx')

The added benefit of these methods is that you are creating a Path object vs. just a string representation of the path. Look at the difference between printing the in_file compared to in_file_1

In [9]:
print(type(in_file))
print(type(in_file_1))

<class 'str'>
<class 'pathlib.PosixPath'>


The fact that the path is an object means we can do a lot of useful actions on the object. It’s also interesting that the path object “knows” it is on a Linux system (aka Posix) and internally represents it that way without the programmer having to tell it. 

Try the glob usage to remove .png files.

In [12]:
import glob
import pathlib

for file in glob.glob('google-download-images/*.svg'):
  path = pathlib.Path(file)
  path.unlink()
  # display(path)

To get the examples started, create the Path to the data_analysis directory:

In [14]:
from pathlib import Path

dir_to_scan = "/content/drive/MyDrive/Colab Notebooks/Pathlib-python/google-download-images"
p = Path(dir_to_scan)

In [15]:
p.is_dir()

True

In [16]:
p.is_file()

False

In [17]:
p.parts

('/',
 'content',
 'drive',
 'MyDrive',
 'Colab Notebooks',
 'Pathlib-python',
 'google-download-images')

In [18]:
p.as_uri()

'file:///content/drive/MyDrive/Colab%20Notebooks/Pathlib-python/google-download-images'

In [19]:
p.parent

PosixPath('/content/drive/MyDrive/Colab Notebooks/Pathlib-python')

Walking Directories
The first approach I will cover is to use the `os_scandir` function to parse all the files and directories in a given path and build a list of all the directories and all the files.

In [20]:
folders = []
files = []

for entry in os.scandir(p):
    if entry.is_dir():
        folders.append(entry)
    elif entry.is_file():
        files.append(entry)

print("Folders - {}".format(folders))
print("Files - {}".format(files))

Folders - []
Files - [<DirEntry 'image2.jpeg'>, <DirEntry 'images3.jpg'>, <DirEntry 'image20.jpeg'>, <DirEntry 'images13.jpg'>, <DirEntry 'image32.jpeg'>, <DirEntry 'image4.jpeg'>, <DirEntry 'images9.jpg'>, <DirEntry 'siberian_husky_cute_puppies.jpg'>, <DirEntry 'images5.jpg'>, <DirEntry 'shelter-dog-cropped-1-632x329.jpg'>, <DirEntry 'labrador-puppy.jpg'>, <DirEntry 'image30.jpeg'>, <DirEntry 'image24.jpeg'>, <DirEntry 'images18.jpg'>, <DirEntry 'image6.jpeg'>, <DirEntry 'image14.jpeg'>, <DirEntry 'images10.jpg'>, <DirEntry 'images15.jpg'>, <DirEntry 'image31.jpeg'>, <DirEntry 'images12.jpg'>, <DirEntry 'WhatsAppImage2020-11-06at10.31.02_1098x1600.jpg'>, <DirEntry 'images17.jpg'>, <DirEntry 'images7.jpg'>, <DirEntry 'image7.jpeg'>, <DirEntry 'image17.jpeg'>, <DirEntry 'image22.jpeg'>, <DirEntry 'image15.jpeg'>, <DirEntry 'image29.jpeg'>, <DirEntry 'image25.jpeg'>, <DirEntry 'image27.jpeg'>, <DirEntry 'image21.jpeg'>, <DirEntry 'images4.jpg'>, <DirEntry 'GettyImages-1290033865.jpg'>, <

If you need to parse through all the subdirectories, then you should use `os_walk` Here is an example that shows all the directories and files within the data_analysis folder.

In [None]:
for dirName, subdirList, fileList in os.walk(p):
    print('Found directory: %s' % dirName)
    for fname in fileList:
        print('\t%s' % fname)

Count the number of files

In [23]:
import os

dir_path = r'/content/drive/MyDrive/Colab Notebooks/Pathlib-python/google-download-images'
print(len([entry for entry in os.listdir(dir_path)]))

291


The first approach is to use glob to list all the files in a directory:

In [None]:
count = 0
for i in p.glob('*.*'):
    print(i.name)
    count += 1
display(count)

There is another option to use the `rglob` to automatically recurse through the subdirectories. Here is a shortcut to build a list of all of the csv files:

In [29]:
list(p.rglob('*.svg'))

[]

This syntax can also be used to exclude portions of a file. In this case, we can get everything except xlsx extensions:

In [28]:
list(p.rglob('*.[!jpg]'))

[]

For this example, I will go through all the files in the data_analysis directory and build a DataFrame with the file name, parent path and modified time. This approach is easily extensible to any other information you might want to include.

Here’s the standalone example:

In [30]:
import pandas as pd
from pathlib import Path
import time

all_files = []
for i in p.rglob('*.*'):
    all_files.append((i.name, i.parent, time.ctime(i.stat().st_ctime)))

columns = ["File_Name", "Parent", "Created"]
df = pd.DataFrame.from_records(all_files, columns=columns)

df.head()

Unnamed: 0,File_Name,Parent,Created
0,image2.jpeg,/content/drive/MyDrive/Colab Notebooks/Pathlib...,Mon Feb 27 03:16:13 2023
1,images3.jpg,/content/drive/MyDrive/Colab Notebooks/Pathlib...,Mon Feb 27 03:15:10 2023
2,image20.jpeg,/content/drive/MyDrive/Colab Notebooks/Pathlib...,Mon Feb 27 03:16:17 2023
3,images13.jpg,/content/drive/MyDrive/Colab Notebooks/Pathlib...,Mon Feb 27 03:16:13 2023
4,image32.jpeg,/content/drive/MyDrive/Colab Notebooks/Pathlib...,Mon Feb 27 03:14:37 2023


`pathlib` `stat().st_ctime`  
stat.ST_CTIME  
The “ctime” as reported by the operating system. On some systems (like Unix) is the time of the last metadata change, and, on others (like Windows), is the creation time (see platform documentation for details).

use `datetime.datetime.fromtimestamp`, i.e.

In [31]:
from datetime import datetime, timezone

stat_result = p.stat()
modified = datetime.fromtimestamp(stat_result.st_mtime, tz=timezone.utc)
print('modified', modified)

modified 2023-02-27 03:33:19+00:00


This works for me if you want a readable string:

In [34]:
import datetime
mtime = p.stat().st_mtime
timestamp_str = datetime.datetime.fromtimestamp(mtime).strftime('%Y-%m-%d %H:%M')

print(timestamp_str)

2023-02-27 03:33
