# 01 - interacting with the file system
 welcome to the first workshop, which trains use of the `os` module 

# problem statement
we'll start by describing the problem we want to solve. if you already know all you need to solve it, you can skip the rest of this session! 

imagine you found a nice dataset you need to analyse, but instead of consisting of a single file or a set of conveniently named files sitting together in a directory the files are scattered about in folders, sub folders, and sub-sub folders. there are tens, or hundreds of them. also, not all of the files are data files, some of them are documentation files. 

you just want a list of the data files so you can iterate over it and process  them all in some way. so you now need to create a python function that takes a path to a root directory as its arguement, and then will traverse the  folder tree structure and collect all the files therein that have a given filename extension and returns a list of the files found (path + filename) 

## bonus: 
filter the data files (assumed to have `.dat` filename ending) and return the list, ordered by **decreasing file size**.

we will start easy. the `os` module allows our python session to interact with the wider world of the operating system outside of it, including the file system.

In [50]:
import os

In [51]:
os.chdir('/Users/crsharp/python_workshop_oskar/python_exercises/exercises/01 intro and setup')

In [52]:
print(os.getcwd()) # get-current-working-directory
whereami = os.getcwd() # save it in a variable.

/Users/crsharp/python_workshop_oskar/python_exercises/exercises/01 intro and setup


In [53]:
def crawl_directory(crawldirectory, file_ending):
    # crawls given directory and returns a list of files within directory all with the specified file ending
    data_files = []
    for current_dir, sub_directories, files in os.walk(crawldirectory):
        for filename in files:
            if filename.endswith(file_ending):
                data_files.append(os.path.join(current_dir, filename))
    return data_files


In [55]:
crawl_directory("exciting_data", ".dat")

['exciting_data/file2.dat',
 'exciting_data/file1.dat',
 'exciting_data/sub_1/file3.dat',
 'exciting_data/sub_3/file5.dat',
 'exciting_data/sub_2/file4.dat',
 'exciting_data/sub_2/subsub_3/thisisenough/file7.dat',
 'exciting_data/sub_2/subsub_1/file6.dat']

In [56]:
# sort by file size
def list_files_by_size(filepathlist, decreasing=True):
    sorted_files = sorted(filepathlist, key=os.path.getsize, reverse=decreasing)
    return sorted_files

In [57]:
orderedlist = list_files_by_size(data_files)

for file in orderedlist:
    print(file, os.path.getsize(file))

exciting_data/sub_2/subsub_3/thisisenough/file7.dat 130
exciting_data/sub_1/file3.dat 65
exciting_data/sub_3/file5.dat 26
exciting_data/file2.dat 13
exciting_data/sub_2/file4.dat 13
exciting_data/sub_2/subsub_1/file6.dat 13
exciting_data/file1.dat 12


In [58]:
def gimmeorderedfiles(directory, filenameending='.dat'):
    # crawling directory with .dat ending and returning indecreasing size
    thelist = crawl_directory(directory, filenameending)
    ordered_files = list_files_by_size(thelist)
    return ordered_files

finalanswer = gimmeorderedfiles('exciting_data', filenameending='.dat')

In [60]:
print(finalanswer)

['exciting_data/sub_2/subsub_3/thisisenough/file7.dat', 'exciting_data/sub_1/file3.dat', 'exciting_data/sub_3/file5.dat', 'exciting_data/file2.dat', 'exciting_data/sub_2/file4.dat', 'exciting_data/sub_2/subsub_1/file6.dat', 'exciting_data/file1.dat']
