Go through the basics of creating a Python script, and then create a Python file for the script to run it on the terminal. In this practice notebook, you'll create the building blocks for a script that finds large files on the filesytem

## Get the logic right 
Start by defining some of the requirements of the script. In this case, we need to:
- _Walk_ the filesystem looking at files, directories and sub-directories
- Capture file information: is it a file? a directory? what size? what path?
- Store that information in a suitable data structure
- Report the sorted data what are the largest files by looking at the data structure

In [7]:
import os
# os.walk() return a series of tuples values for path, directories(folders) and files
# below is simply the root path and then the subsequent folders and files at that particular path
for dirpath, dirnames, filenames in os.walk('.'):
    print(type(dirpath), type(dirnames), type(filenames))
    print(dirpath, dirnames, filenames)
    break

<class 'str'> <class 'list'> <class 'list'>
. ['scripts', '.git', '.ipynb_checkpoints'] ['large_files.py', '.gitignore', 'querying-databases.ipynb', 'README.md', 'sqlite-operations.ipynb', 'large-files.ipynb']


In [16]:
# Here we can capture all the paths details in a dictionary along with their coordinating directories(folders) and files
paths = {}
for dirpath, dirnames, filenames in os.walk('.'):
    key = dirpath
    if key not in paths:
        paths[key] = {}
        paths[key]['directories'] = dirnames
        paths[key]['files'] = filenames

print(paths.keys())
print(len(paths['./.git']['files']))

dict_keys(['.', './scripts', './.git', './.git/refs', './.git/refs/heads', './.git/refs/remotes', './.git/refs/remotes/origin', './.git/hooks', './.git/info', './.git/objects', './.git/objects/88', './.git/objects/dd', './.git/objects/ec', './.git/objects/d4', './.git/objects/b6', './.git/objects/17', './.git/objects/c1', './.git/objects/c9', './.git/objects/66', './.git/objects/e7', './.git/objects/28', './.git/objects/a9', './.git/objects/82', './.git/objects/33', './.git/objects/de', './.git/objects/c6', './.git/objects/a8', './.git/objects/18', './.git/objects/da', './.git/objects/c8', './.git/objects/fe', './.git/objects/74', './.git/objects/af', './.git/objects/70', './.git/objects/6a', './.git/logs', './.git/logs/refs', './.git/logs/refs/heads', './.git/logs/refs/remotes', './.git/logs/refs/remotes/origin', './.ipynb_checkpoints'])
7


In [23]:
# The os module is perfect for filesystem operations like "walking" throught directories and files
# Although there are many ways of achieving the same effect, a good way to loop over the filesystem is using `os.walk()`
import os
for root, directories, files in os.walk('.'):
    for _file in files:
        print(f"Total files {len(files)} - {files} from current root : {root}")
        print(f"Total directories {len(directories)} - {directories} at current root : {root}" )
        break

Total files 6 - ['large_files.py', '.gitignore', 'querying-databases.ipynb', 'README.md', 'sqlite-operations.ipynb', 'large-files.ipynb'] from current root : .
Total directories 3 - ['scripts', '.git', '.ipynb_checkpoints'] at current root : .
Total files 1 - ['generate_large_files.py'] from current root : ./scripts
Total directories 0 - [] at current root : ./scripts
Total files 7 - ['index', 'description', 'ORIG_HEAD', 'HEAD', 'config', 'packed-refs', 'FETCH_HEAD'] from current root : ./.git
Total directories 5 - ['refs', 'hooks', 'info', 'objects', 'logs'] at current root : ./.git
Total files 1 - ['main'] from current root : ./.git/refs/heads
Total directories 0 - [] at current root : ./.git/refs/heads
Total files 2 - ['HEAD', 'main'] from current root : ./.git/refs/remotes/origin
Total directories 0 - [] at current root : ./.git/refs/remotes/origin
Total files 11 - ['pre-push.sample', 'pre-applypatch.sample', 'applypatch-msg.sample', 'post-update.sample', 'pre-rebase.sample', 'fsmo

In [24]:
# Update the loop so that it shows the absolute path of a file ignoring directories which we aren't going to track
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        print(f"File found: {full_path}")


File found: ./large_files.py
File found: ./.gitignore
File found: ./querying-databases.ipynb
File found: ./README.md
File found: ./sqlite-operations.ipynb
File found: ./large-files.ipynb
File found: ./scripts/generate_large_files.py
File found: ./.git/index
File found: ./.git/description
File found: ./.git/ORIG_HEAD
File found: ./.git/HEAD
File found: ./.git/config
File found: ./.git/packed-refs
File found: ./.git/FETCH_HEAD
File found: ./.git/refs/heads/main
File found: ./.git/refs/remotes/origin/HEAD
File found: ./.git/refs/remotes/origin/main
File found: ./.git/hooks/pre-push.sample
File found: ./.git/hooks/pre-applypatch.sample
File found: ./.git/hooks/applypatch-msg.sample
File found: ./.git/hooks/post-update.sample
File found: ./.git/hooks/pre-rebase.sample
File found: ./.git/hooks/fsmonitor-watchman.sample
File found: ./.git/hooks/prepare-commit-msg.sample
File found: ./.git/hooks/commit-msg.sample
File found: ./.git/hooks/pre-commit.sample
File found: ./.git/hooks/update.sample

So now we have a few objectives completed:
- Files are detected
- Full paths are being collected

Next, we need to find size information. Python uses bytes by default for size, so in addition to capturing the size, we'll need to find a way to change bytes to megabytes or gigabytes to make it easier to read

In [25]:
# Update the loop to include the file size
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        size = os.path.getsize(full_path)
        print(f"Size: {size}b - File: {full_path}")

Size: 275811b - File: ./large_files.py
Size: 1799b - File: ./.gitignore
Size: 16667b - File: ./querying-databases.ipynb
Size: 61b - File: ./README.md
Size: 4447b - File: ./sqlite-operations.ipynb
Size: 20399b - File: ./large-files.ipynb
Size: 639b - File: ./scripts/generate_large_files.py
Size: 681b - File: ./.git/index
Size: 73b - File: ./.git/description
Size: 41b - File: ./.git/ORIG_HEAD
Size: 21b - File: ./.git/HEAD
Size: 268b - File: ./.git/config
Size: 112b - File: ./.git/packed-refs
Size: 107b - File: ./.git/FETCH_HEAD
Size: 41b - File: ./.git/refs/heads/main
Size: 30b - File: ./.git/refs/remotes/origin/HEAD
Size: 41b - File: ./.git/refs/remotes/origin/main
Size: 1348b - File: ./.git/hooks/pre-push.sample
Size: 424b - File: ./.git/hooks/pre-applypatch.sample
Size: 478b - File: ./.git/hooks/applypatch-msg.sample
Size: 189b - File: ./.git/hooks/post-update.sample
Size: 4898b - File: ./.git/hooks/pre-rebase.sample
Size: 3327b - File: ./.git/hooks/fsmonitor-watchman.sample
Size: 149

In [26]:
# Persist the data into a dictionary. Since file paths are unique you can use those as dictionary keys
file_metadata = {}
for root, directories, files in os.walk('.'):
    for _file in files:
        full_path = os.path.join(root, _file)
        size = os.path.getsize(full_path)
        file_metadata[full_path] = size
print(file_metadata)

{'./large_files.py': 275811, './.gitignore': 1799, './querying-databases.ipynb': 16667, './README.md': 61, './sqlite-operations.ipynb': 4447, './large-files.ipynb': 23846, './scripts/generate_large_files.py': 639, './.git/index': 681, './.git/description': 73, './.git/ORIG_HEAD': 41, './.git/HEAD': 21, './.git/config': 268, './.git/packed-refs': 112, './.git/FETCH_HEAD': 107, './.git/refs/heads/main': 41, './.git/refs/remotes/origin/HEAD': 30, './.git/refs/remotes/origin/main': 41, './.git/hooks/pre-push.sample': 1348, './.git/hooks/pre-applypatch.sample': 424, './.git/hooks/applypatch-msg.sample': 478, './.git/hooks/post-update.sample': 189, './.git/hooks/pre-rebase.sample': 4898, './.git/hooks/fsmonitor-watchman.sample': 3327, './.git/hooks/prepare-commit-msg.sample': 1492, './.git/hooks/commit-msg.sample': 896, './.git/hooks/pre-commit.sample': 1642, './.git/hooks/update.sample': 3610, './.git/hooks/pre-receive.sample': 544, './.git/info/exclude': 240, './.git/objects/88/62a800dae9c

**Exercise:** Now that the metadata is captured and stored in a suitable data structure like a dictionary, report back the results with only the four largest files. Try using other quantities to report on, like the 10 largest files instead of 4.

In [38]:
# for metadata file above
def largestXFiles(x):
    largestDictionaryValues = sorted(file_metadata, key=lambda x: x[1], reverse=True)[:x]
    # conditional way of grabbing the top values w/dict comprehension
    print({k: file_metadata[k] for k in file_metadata.keys() if k in largestDictionaryValues})
    print(f"The largest {x} files within the file_metadata object are : {largestDictionaryValues}")

In [39]:
largestXFiles(5)

{'./large_files.py': 275811, './.gitignore': 1799, './querying-databases.ipynb': 16667, './README.md': 61, './sqlite-operations.ipynb': 4447}
The largest 5 files within the file_metadata object are : ['./large_files.py', './.gitignore', './querying-databases.ipynb', './README.md', './sqlite-operations.ipynb']


In [41]:
largestXFiles(8)

{'./large_files.py': 275811, './.gitignore': 1799, './querying-databases.ipynb': 16667, './README.md': 61, './sqlite-operations.ipynb': 4447, './large-files.ipynb': 23846, './scripts/generate_large_files.py': 639, './.git/index': 681}
The largest 8 files within the file_metadata object are : ['./large_files.py', './.gitignore', './querying-databases.ipynb', './README.md', './sqlite-operations.ipynb', './large-files.ipynb', './scripts/generate_large_files.py', './.git/index']
