# Python for scientific research
# Working with files and filesystems (file IO)
# Answers to exercises

### Bram Kuijper
### University of Exeter, Penryn Campus, UK
### February 2020

## Exercise 1
Go to the [``os.path``](https://docs.python.org/3/library/os.path.html#module-os.path) page and read through the various methods available. 

### Exercise 1.1
Find the three functions you think are most often used to obtain information about a file or directory.

This is a matter of taste of what you deem important, but I would say:

1. ``os.path.expanduser(path)``: when provided with ``path="~"`` returns one's home directory as a ``str`` object.
2. ``os.path.exists(path)``: when a file or directory given by ``path`` exists, returns ``True`` and ``False`` otherwise
3. ``os.path.join(path1, path2, path3, ...etc)``: saefely multiple paths together. Concatenating strings (``path1 + "/" + path2``) can lead to errors, as you don't know whether ``"\"`` (windows) or ``"/"`` (unix) is used as a path separator, or whether ``path1`` already has a trailing ``"/"`` as in ``"/home/foo/bar/"``

### Exercise 1.2

The [``__file__``](https://docs.python.org/3/reference/import.html?highlight=__file__#__file__) variable contains the filename of the current python script. Write a function that is called ``current_file_info()``, accepts a file name as argument and returns a dictionary containing:
  * the directory name in which your script resides
  * the basename (i.e., ``script.py`` without the top-level directory)
  * the extension of the script (if it is ``script.py`` it should give ``.py``)
  * the creation time of the file (in seconds since January 1, 1970, 00:00:00 -- this sounds more difficult than it actually is)

In [1]:
import os.path


def current_file_info(the_file):
    
    # get the absolute, not the relative path
    the_file = os.path.abspath(the_file)
    
    # directory name
    dirname = os.path.dirname(the_file)
    
    # the filename itself
    basename = os.path.basename(the_file)
    
    # the extension
    ext = os.path.splitext(the_file)[1]
    
    # the creation time
    ctime = os.path.getctime(the_file)
    
    return({"dir":dirname,"base":basename,"ext":ext,"ctime":ctime})

In your script, you can use the ``__file__`` variable. However, as I made these exercises in a notebook rather than a conventional Python script, this variable is not available. To this end, I just mimick this ``__file__`` by joining the current path (``"."``) with some random file name:

In [2]:
# as the __file__ variable is not available within these
# notebooks (only within scripts), we quickly create a temporary file
__file_mimick = os.path.join(
    os.path.abspath("."),"some_script3834872492.py")

with open(__file_mimick,mode="w") as f:
    f.write("some contents")
    
print(current_file_info(the_file = __file_mimick))

os.remove(__file_mimick)

{'dir': '/home/bram/Projects/4_Teaching/2019_2020/Python/slides/day_2', 'base': 'some_script3834872492.py', 'ext': '.py', 'ctime': 1583232057.5712059}


## Exercise 2
Getting raw text files into an object that is amenable for data analysis can sometimes be a tedious task. For most comma separated files, the function [``pandas.read_csv()``](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) from the [pandas](https://pandas.pydata.org/) library does the job. However, in many other cases, the raw data contains descriptions, whitespace, irregular columns and other nuisances which means some preprocessing is necessary. Here we focus on such a case:

### Exercise 2.1
Inspect the [iso_8859-1.txt](https://www.w3.org/TR/PNG/iso_8859-1.txt) file by clicking on this [link](https://www.w3.org/TR/PNG/iso_8859-1.txt). Save the file somewhere on your home directory. If the file opens in another browser tab/window, press ctrl + s to save it to disk. The file contains the hexadecimal character code for the character set ISO-8859-1, but that is not important here. What we want to do is get **all the hexadecimal numbers in a single column, followed by their descriptions**. At the moment, however, there are sometimes one, sometimes two columns, and the whole thing is preceded by column names and a description, which need to be removed.

### Exercise 2.2
To start, write a function ``filter_hex_codes(pathname)`` that opens the file and only returns a list of lines containing either one or two hex codes and their descriptions. The description at the start of the file, or any whitespace at the start or end of each line (not in the middle) should have been removed. Hence, the first element of the list returned by this function should be ``['20  SPACE']``. The second element should be ``['21  EXCLAMATION MARK     A1  INVERTED EXCLAMATION MARK']`` and the last element should be ``['FF  SMALL LETTER Y WITH DIAERESIS']``. 

Hint: do not try to cram every possible pattern in a regex (although they will be needed). Look at other [text processing functions](https://docs.python.org/3/library/stdtypes.html?highlight=str%20split#string-methods) with which you can reduce the complexity of the problem

In [3]:
import re

def filter_hex_codes(filename):
    
    filtered_list = []
       
    with open(file=filename, mode="r") as f_obj:

        for line in f_obj:
            
            # strip whitespace
            # this immediately removes hassle with the last line
            # as that is now 'FF  SMALL LETTER Y WITH DIAERESIS'
            stripped_line = line.strip()
            
            # match a line starting a hexadecimal character
            if re.search(pattern="^[0-9A-F]", string=stripped_line) != None:
                filtered_list.append(stripped_line)
                
    return(filtered_list)
                    
file_name = "iso_8859-1.txt"
hex_list = filter_hex_codes(filename=file_name)

# output in notebook:
[i for i in hex_list]

['20  SPACE',
 '21  EXCLAMATION MARK            A1  INVERTED EXCLAMATION MARK',
 '22  QUOTATION MARK              A2  CENT SIGN',
 '23  NUMBER SIGN                 A3  POUND SIGN',
 '24  DOLLAR SIGN                 A4  CURRENCY SIGN',
 '25  PERCENT SIGN                A5  YEN SIGN',
 '26  AMPERSAND                   A6  BROKEN BAR',
 '27  APOSTROPHE                  A7  SECTION SIGN',
 '28  LEFT PARENTHESIS            A8  DIAERESIS',
 '29  RIGHT PARENTHESIS           A9  COPYRIGHT SIGN',
 '2A  ASTERISK                    AA  FEMININE ORDINAL INDICATOR',
 '2B  PLUS SIGN                   AB  LEFT-POINTING DOUBLE ANGLE QUOTATION MARK',
 '2C  COMMA                       AC  NOT SIGN',
 '2D  HYPHEN-MINUS                AD  SOFT HYPHEN',
 '2E  FULL STOP                   AE  REGISTERED SIGN',
 '2F  SOLIDUS                     AF  OVERLINE',
 '30  DIGIT ZERO                  B0  DEGREE SIGN',
 '31  DIGIT ONE                   B1  PLUS-MINUS SIGN',
 '32  DIGIT TWO                   B2  SUPERS

### Exercise 2.3
Make another function that accepts the list returned by ``filter_hex_codes(pathname)`` and prints the two columns of  
``hex code  character description      hex code  character description``   
as a single ``;``-separated list of  
``hex code;character description ``  

The order of the hex code in the list is not important.

A potential output could be:

``['20;SPACE',
 '21;EXCLAMATION MARK',
 'A1;INVERTED EXCLAMATION MARK',
 '22;QUOTATION MARK',
 'A2;CENT SIGN',
 '23;NUMBER SIGN',
 'A3;POUND SIGN',
 ...
 '7E;TILDE',
 'FE;SMALL LETTER THORN (Icelandic)',
 'FF;SMALL LETTER Y WITH DIAERESIS']``

**Hint:** check out ``re.split()`` if you want to split a line based on a slightly more complicated pattern than possible with ``str.split()``

In [4]:
def single_column_list_unordered(multicolumn_list):
    
    # make a new list for the new elements which 
    # reflect a single column only
    single_col_list = []
    
    # go through each line
    for list_elmt in multicolumn_list:
        
        # split the list according to the occurrence of 2 or more whitespaces
        # one whitespace is not enough as you then also split your descriptions (a mess...)
        split_list = re.split(pattern=r"\s{2,}",string=list_elmt)
        
        # always at least a single hit
        single_col_list.append(split_list[0:2])
           
        # but potentially another hit too
        if len(split_list) == 4:
            single_col_list.append(split_list[2:4])
       
    # add semicolons using a list comprehension
    csv_list = [";".join(i) for i in single_col_list]
    return(csv_list)
        
# call the function
result_list = single_column_list_unordered(multicolumn_list=hex_list)

[i for i in result_list]

['20;SPACE',
 '21;EXCLAMATION MARK',
 'A1;INVERTED EXCLAMATION MARK',
 '22;QUOTATION MARK',
 'A2;CENT SIGN',
 '23;NUMBER SIGN',
 'A3;POUND SIGN',
 '24;DOLLAR SIGN',
 'A4;CURRENCY SIGN',
 '25;PERCENT SIGN',
 'A5;YEN SIGN',
 '26;AMPERSAND',
 'A6;BROKEN BAR',
 '27;APOSTROPHE',
 'A7;SECTION SIGN',
 '28;LEFT PARENTHESIS',
 'A8;DIAERESIS',
 '29;RIGHT PARENTHESIS',
 'A9;COPYRIGHT SIGN',
 '2A;ASTERISK',
 'AA;FEMININE ORDINAL INDICATOR',
 '2B;PLUS SIGN',
 'AB;LEFT-POINTING DOUBLE ANGLE QUOTATION MARK',
 '2C;COMMA',
 'AC;NOT SIGN',
 '2D;HYPHEN-MINUS',
 'AD;SOFT HYPHEN',
 '2E;FULL STOP',
 'AE;REGISTERED SIGN',
 '2F;SOLIDUS',
 'AF;OVERLINE',
 '30;DIGIT ZERO',
 'B0;DEGREE SIGN',
 '31;DIGIT ONE',
 'B1;PLUS-MINUS SIGN',
 '32;DIGIT TWO',
 'B2;SUPERSCRIPT TWO',
 '33;DIGIT THREE',
 'B3;SUPERSCRIPT THREE',
 '34;DIGIT FOUR',
 'B4;ACUTE ACCENT',
 '35;DIGIT FIVE',
 'B5;MICRO SIGN',
 '36;DIGIT SIX',
 'B6;PILCROW SIGN',
 '37;DIGIT SEVEN',
 'B7;MIDDLE DOT',
 '38;DIGIT EIGHT',
 'B8;CEDILLA',
 '39;DIGIT NINE',
 

### Exercise 2.4
Write the contents to a file ``iso_8859-1-single-columnn.txt``

In [5]:
# join string together in a list
single_str = "\n".join(result_list)

with open(file="iso_8859-1-single-columnn.txt",mode="w") as f_obj:
    f_obj.write(single_str)

## Exercise 3
Bringing data together from multiple files is another task for which Python is commonly used. Here we work on a directory containing a range of different files and try to combine data

### Exercise 3.1
Download and unpack the zip file in the following [link](https://github.com/bramkuijper/stress/raw/master/figs/compare_dp/some_data.zip). Move it to a folder somewhere in your home directory (we are not going to delete anything so no worries) and inspect the directory structure. It contains different files distributed over a bunch of subdirectories. Using notepad to open a file may be less than helpful, but one can always use the menu option ``File``>``Open`` in Spyder to inspect a raw data file. Just make sure that the `All files` option is selected in the `Select file` dialog.

### Exercise 3.2
Write a function ``files_matching_regex(pattern, path)`` which uses the [``os.walk()``](https://docs.python.org/3/library/os.html#os.walk) function to loop over all files in the unpacked zip file above. It returns a list with the complete pathnames of those files which match to a certain regular expression given by ``pattern``. The location of the top-level directory containing the data should be provided by the keyword argument ``path``.

Then test the function and (i) specify a pattern which matches all the files ending on `.csv` and (ii) another pattern that matches all the files containing `02_2020` but are neither preceded by `graph` nor end on `*_iters.csv`

In [13]:
def files_matching_regex(pattern, path):
    
    if not os.path.exists(path):
        return
    
    matching_files = []
    
    for root, dirs, files in os.walk(top=path):
        
        # files is a list of files in the current directory
        for file in files:
            if re.search(pattern=pattern, string=file) != None:
                matching_files.append(os.path.join(root,file))
                
    return(matching_files)

the_path = "/home/bram/Projects/stress/figs/compare_dp/some_data"

#### Matching CSV files:

In [14]:
# the regex
csv_pattern = "csv$"

# call the function and return a list
csv_matches = files_matching_regex(pattern=csv_pattern, path=the_path)

# just for output purposes:
[match_i for match_i in csv_matches]

['/home/bram/Projects/stress/figs/compare_dp/some_data/dir5/sim_stress_17_02_2020_152706_3iters.csv',
 '/home/bram/Projects/stress/figs/compare_dp/some_data/dir4/sim_stress_17_02_2020_152706_5iters.csv',
 '/home/bram/Projects/stress/figs/compare_dp/some_data/dir3/sim_stress_17_02_2020_152706_4iters.csv',
 '/home/bram/Projects/stress/figs/compare_dp/some_data/dir3/dir3a/sim_stress_14_02_2020_232531_3iters.csv',
 '/home/bram/Projects/stress/figs/compare_dp/some_data/dir1/sim_stress_17_02_2020_152706_2iters.csv',
 '/home/bram/Projects/stress/figs/compare_dp/some_data/dir2/sim_stress_17_02_2020_152622_1iters.csv']

#### Matching files containing `02_2020` but are neither preceded by `graph` nor end on `*_iters.csv`

In [15]:
feb2020_pattern = "^sim.*02_2020.*\d$"

feb2020_matches = files_matching_regex(pattern=feb2020_pattern, path=the_path)

[match_i for match_i in feb2020_matches]

['/home/bram/Projects/stress/figs/compare_dp/some_data/sim_stress_17_02_2020_182706_4',
 '/home/bram/Projects/stress/figs/compare_dp/some_data/dir5/sim_stress_17_02_2020_152706_3',
 '/home/bram/Projects/stress/figs/compare_dp/some_data/dir4/sim_stress_17_02_2020_152706_5',
 '/home/bram/Projects/stress/figs/compare_dp/some_data/dir3/sim_stress_17_02_2020_152706_4',
 '/home/bram/Projects/stress/figs/compare_dp/some_data/dir3/dir3a/sim_stress_14_02_2020_232531_3',
 '/home/bram/Projects/stress/figs/compare_dp/some_data/dir1/sim_stress_17_02_2020_152706_2',
 '/home/bram/Projects/stress/figs/compare_dp/some_data/dir2/sim_stress_17_02_2020_152622_1']

### Exercise 3.3
Now write another function ``match_line(path, pattern)`` that opens a file given by keyword argument ``path`` and which returns the first line matching the keyword argument ``pattern``. Now use this function in combination with the ``files_matching_regex(pattern, path)`` function above to get a list of all the lines starting with `sP2NP` from all files starting with `sim` and ending on the digits `1`, `3` and `5`. The listing should be (not necessarily in this order):

``['sP2NP_1;0.19;', 'sP2NP_1;0.19;', 'sP2NP_1;0.5;', 'sP2NP_1;0.89;']``

In [16]:
def match_line(path, pattern):
    
    with open(path) as f_obj:
        
        for line in f_obj:
            if re.search(pattern=pattern, string=line) != None:
                return(line.strip())

In [17]:
sim_last_digit_pattern = "^sim.*[135]$"

matches =  files_matching_regex(pattern=sim_last_digit_pattern, path=the_path)

# print all the matching files
print("\n".join(matches))

# now recover the matching lines
param_list = [match_line(path=path_i,pattern="sP2NP") for path_i in matches]

print(param_list)

/home/bram/Projects/stress/figs/compare_dp/some_data/dir5/sim_stress_17_02_2020_152706_3
/home/bram/Projects/stress/figs/compare_dp/some_data/dir4/sim_stress_17_02_2020_152706_5
/home/bram/Projects/stress/figs/compare_dp/some_data/dir3/dir3a/sim_stress_14_02_2020_232531_3
/home/bram/Projects/stress/figs/compare_dp/some_data/dir2/sim_stress_17_02_2020_152622_1
['sP2NP_1;0.19;', 'sP2NP_1;0.5;', 'sP2NP_1;0.89;', 'sP2NP_1;0.19;']
