# 01 - interacting with the file system

# problem statement
we'll start by describing the problem we want to solve. if you already know all you need to solve it, you can skip the rest of this session! 

imagine you found a nice dataset you need to analyse, but instead of consisting of a single file or a set of conveniently named files sitting together in a directory the files are scattered about in folders, sub folders, and sub-sub folders. there are tens, or hundreds of them. also, not all of the files are data files, some of them are documentation files. 

you just want a list of the data files so you can iterate over it and process  them all in some way. so you now need to create a python function that takes a path to a root directory as its arguement, and then will traverse the  folder tree structure and collect all the files therein that have a given filename extension and returns a list of the files found (path + filename) 

## bonus: 
filter the data files (assumed to have `.dat` filename ending) and return the list, ordered by **decreasing file size**.

In [1]:
import os
import operator

## solution - list all files with given filename ending, ordered by decreasing file size

In [2]:
# set the root directory 
start_here = '/Users/katiea/git/python_workshop/python_exercises/exercises/01 intro and setup'

In [3]:
# provide filename ending
filename_ending = '.dat'

In [4]:
# function that takes a list of files, retrieves the filesize, sorts the list of files in decreasing order of size
# and returns the sorted list 

def getKey(item):
    return item[1]

def order_files_by_size(list_of_files):
    list_of_files_and_size = []
    for f in list_of_files:
        file_size_tuple = (f, os.path.getsize(f))
        list_of_files_and_size.append(file_size_tuple)
    sorted_list_of_files = sorted(list_of_files_and_size, key=getKey, reverse=True)
    return sorted_list_of_files

In [5]:
# function that takes a root directory and file extension, traverses the root directory, retrieves all files in
# the root directory and sub-directories that end with the specified file extension, and then uses the above function
# to order them according to filesize

def retrieve_list_of_files(root_directory, file_extension):
    list_of_files = []
    for current_dir, directories, files in os.walk(root_directory):
        for file in files:
            if file.endswith(file_extension):
                list_of_files.append(os.path.join(current_dir, file))
    ordered_list_of_files = order_files_by_size(list_of_files)
    return ordered_list_of_files

In [6]:
# Ta-da!
retrieve_list_of_files(start_here, filename_ending)

[('/Users/katiea/git/python_workshop/python_exercises/exercises/01 intro and setup/exciting_data/sub_2/subsub_3/thisisenough/file7.dat',
  130),
 ('/Users/katiea/git/python_workshop/python_exercises/exercises/01 intro and setup/exciting_data/sub_1/file3.dat',
  65),
 ('/Users/katiea/git/python_workshop/python_exercises/exercises/01 intro and setup/exciting_data/sub_3/file5.dat',
  26),
 ('/Users/katiea/git/python_workshop/python_exercises/exercises/01 intro and setup/exciting_data/file2.dat',
  13),
 ('/Users/katiea/git/python_workshop/python_exercises/exercises/01 intro and setup/exciting_data/sub_2/file4.dat',
  13),
 ('/Users/katiea/git/python_workshop/python_exercises/exercises/01 intro and setup/exciting_data/sub_2/subsub_1/file6.dat',
  13),
 ('/Users/katiea/git/python_workshop/python_exercises/exercises/01 intro and setup/exciting_data/file1.dat',
  12)]