# Working With Files


## The pathlib module
The pathlib module is documented at https://docs.python.org/3/library/pathlib.html 
The module documentaion states:

"If you’ve never used this module before or just aren’t sure which class is right for your task, Path is most likely what you need. It instantiates a concrete path for the platform the code is running on."

Path lets you separate the folders and files in a path with a forward slash, which then works on both windows and other operating systems like Mac or Linux that use forward slashes.

In [4]:
from pathlib import Path

dataDirectory = r'/Users/cagilalbayrak/LAS792/data'

DataFolder = Path(r'/Users/cagilalbayrak/LAS792/data')
#DataFolder = Path(r'C:\Users\zambrana\OneDrive - The University of Kansas\LAS792_Fall2021_ForStudents')
print("as a string the path is: ", str(DataFolder))

str(DataFolder)


as a string the path is:  /Users/cagilalbayrak/LAS792/data


'/Users/cagilalbayrak/LAS792/data'

### Concatening Path elements

The Path method can concatenate arguments into a single path

The / operator can do the same **for Path objects**

In [7]:
newPath = Path(DataFolder,"firstDay")
print(newPath)

alternative = DataFolder / "firstDay"
print(alternative)

/Users/cagilalbayrak/LAS792/data/firstDay
/Users/cagilalbayrak/LAS792/data/firstDay


### Current home directory

In [8]:
Path.home()

PosixPath('/Users/cagilalbayrak')

### Current working directory

In [14]:
Path.cwd()

PosixPath('/Users/cagilalbayrak/LAS792')

### Making a new folder
The Path.mkdir() method makes a new folder. It returns an error it the folder already exists.

In [10]:

otherPath = Path(r'/Users/cagilalbayrak/LAS792/data')

newFolderPath = Path.home() / "testFolder"
Path(newFolderPath).mkdir()




FileExistsError: [Errno 17] File exists: '/Users/cagilalbayrak/testFolder'

### Catching errors

Sometimes you want to catch and error and do something with it and then proceed instead of halting.
The try ... except lets you do that.

You should "catch" only the specific error(s) you might expect. Fid them by generating the error (like the "*FileExistsError*" above).

Try changing the folder name below and then running it twice.

In [12]:
try:
    newFolderPath2 = Path.home() / "testFolder2"
    Path(newFolderPath2).mkdir()
except FileExistsError:
    print(newFolderPath2, " already exists ... cool")

/Users/cagilalbayrak/testFolder2  already exists ... cool


### Path parts
Path has methods to extract parts of a path. The example below shows which parts of the path are extracted by these properties of the path object.

- pathObject.anchor
- pathObject.parent
- pathObject.name
- pathObject.stem
- pathObject.suffix

In [13]:

newFilePath = newFolderPath / "junk.txt"
print("For the path: ", newFilePath)

print("\nanchor: ", newFilePath.anchor, " ;  type: ",  type(newFolderPath.anchor))
print("parent: ", newFilePath.parent, " ;  type: ",   type(newFolderPath.parent))
print("name: ", newFilePath.name, " ;  type: ",   type(newFolderPath.name))
print("stem: ", newFilePath.stem, " ;  type: ",   type(newFolderPath.stem))
print("suffix: ", newFilePath.suffix, " ;  type: ",   type(newFolderPath.suffix))

For the path:  /Users/cagilalbayrak/testFolder/junk.txt

anchor:  /  ;  type:  <class 'str'>
parent:  /Users/cagilalbayrak/testFolder  ;  type:  <class 'pathlib.PosixPath'>
name:  junk.txt  ;  type:  <class 'str'>
stem:  junk  ;  type:  <class 'str'>
suffix:  .txt  ;  type:  <class 'str'>


### Does it exist, and if so what kind of beast is it?

In [15]:
print(newFilePath.exists())
print(newFolderPath.is_file())
print(newFolderPath.is_dir())

False
False
True


### Writing to a new file
You can open a path to a file that does not exist. Python will create the file. see https://docs.python.org/3/library/functions.html#open 
The second argument states how to open the file:

- 'r' open for reading (default)

- 'w' open for writing, truncating the file first

- 'x' - open for exclusive creation, failing if the file already exists

- 'a' - open for writing, appending to the end of the file if it exists

- 'b' - binary mode

- 't' - text mode (default)

- '+' - open for updating (reading and writing)


In the code that follows the new file will contain:

this is the first line.  this more on that line.
<br>this is the second line. 


In [20]:
testFile = open(newFilePath, 'w')
print("writing to file ", newFilePath)
charsWritten = 0

#In Python "X += Y" is the same as "X = X + Y"
#in stata, ++'i'
# i++ means use the value of i hen increase by i

charsWritten  += testFile.write("this is the first line. ")
charsWritten  += testFile.write(" this more on that line.")
charsWritten  += testFile.write("\nthis is the second line. ") #note the \n that starts the new line

print("The three writes wrote ", charsWritten, " characters")
testFile.close()

writing to file  /Users/cagilalbayrak/testFolder/junk.txt
The three writes wrote  74  characters


In [21]:
with open(newFilePath, 'a') as f:  #append is a write is w
    f.write("\nSome other line")

### trying to write without permission

In [22]:
# open in default text mode and try to write
testFile = open(newFilePath)
charsWritten = 0

charsWritten  += testFile.write("written in text mode. ")

print("The three writes wrote ", charsWritten, " characters")

testFile.close()

UnsupportedOperation: not writable

## Archiving Python data objects
Native Python objects (like a dict)  can be archived as the text that creates them. The pprint.pformat method produces the text that would recreate a Python object. It does not do so for a Pandas DataFrame.
The DataFrame can be expressed as a dictionary that would recreate it when called as the argument to the pd.DataFrame method.

In [71]:
import pandas as pd
import pprint
myDataAsDict = {'Name':["John", "Jane", "Emma"],
                    'Age':[21,20,30]}

myDf = pd.DataFrame(myDataAsDict, index=['zero', 'one', 'two'])

pprint.pprint(myDf)

myDfAsText = pprint.pformat(myDf.to_dict())
print(myDfAsText)


# recreate the DataFrame by evaluating the string
myDfFromDict = pd.DataFrame(eval(myDfAsText))
myDfFromDict

      Name  Age
zero  John   21
one   Jane   20
two   Emma   30
{'Age': {'one': 20, 'two': 30, 'zero': 21},
 'Name': {'one': 'Jane', 'two': 'Emma', 'zero': 'John'}}


Unnamed: 0,Age,Name
one,20,Jane
two,30,Emma
zero,21,John


### writing a module
The function below makes the DataFrame statement that will recreate a DataFrame. 
It could be written to a .py file.  Importing that file makes the DataFrame a 
property of the module.

In [72]:
import pandas as pd

def DfToText(dFname, dFrame):
    # converts a dataframe to text code that would recreate the data frame
    DfAsDictText = pprint.pformat(myDf.to_dict())
    return dFname + " = pd.DataFrame(" + DfAsDictText + ")"
print(DfToText("myNewDf", myDf))

dFSaveFile = open("dFcode.py", 'w')

dFSaveFile.write("import pandas as pd\n")
dFSaveFile.write(DfToText("myNewDf", myDf))
dFSaveFile.write("\n")

dFSaveFile.close()

# The objects created in the file dFcode.py 
# become properties of the module dFcode
import dFcode
# This is the myNewDf property
dFcode.myNewDf

myNewDf = pd.DataFrame({'Age': {'one': 20, 'two': 30, 'zero': 21},
 'Name': {'one': 'Jane', 'two': 'Emma', 'zero': 'John'}})


Unnamed: 0,Age,Name
one,20,Jane
two,30,Emma
zero,21,John


In [None]:
## this is revised try this as well.

import pandas as pd
myDf2 = myDf.copy()
myDf2['foo'] = 1 
print (myDf2)

def DfToText(dFname, dFrame):
    # converts a dataframe to text code that would recreate the data frame
    DfAsDictText = pprint.pformat(dFrame.to_dict())
    return dFname + " = pd.DataFrame(" + DfAsDictText + ")"
print(DfToText("myNewDf", myDf2))

dFSaveFile = open("dFcode.py", 'w')

dFSaveFile.write("import pandas as pd\n")
dFSaveFile.write(DfToText("myNewDf", myDf))
dFSaveFile.write("\n")

dFSaveFile.close()

# The objects created in the file dFcode.py 
# become properties of the module dFcode
import dFcode
# This is the myNewDf property
dFcode.myNewDf

## Traversing a folder with os.walk()
The os.walk() function from the os module https://docs.python.org/3/library/os.html  traverses a given folder and returns all of the folders and files in that older and all of its subfolders.



### Listing folder contents
This code prints a list of the contents of each folder in the tree

In [73]:
# a simple listing of names
import os
from pathlib import Path

newPath = Path.cwd()

for targetFolder, folders, files in os.walk(str(newPath)):
    print('\n', newPath)
    print('Folders: ')
    for folder in folders:
        print(folder)
    print('\nFiles: ')
    for file in files:
        print(file)                                            


 C:\Users\zambrana\OneDrive - The University of Kansas\LAS792_Fall2021\JupyterNotebooks
Folders: 
.ipynb_checkpoints
__pycache__

Files: 
Characters.ipynb
DataFrameIndexes.ipynb
dFcode.py
Examples2021_02_18.ipynb
Examples2021_09_07.ipynb
Exercises2021_02_11.ipynb
fixedHD5.h5
IteratingDataFrames.ipynb
JSON.ipynb
Merging_Tables_with_Pandas-Copy1.ipynb
Merging_Tables_with_Pandas.ipynb
PythonBasics.ipynb
PythonBasics_Exercises_08_26_2021.ipynb
PythonFunctions.ipynb
PythonPlots.ipynb
Reading1.ipynb
Reading2.ipynb
RegularExpressionsInPython.ipynb
RestructuringData.ipynb
RestructuringData_20210921.ipynb
SentinelAndMissing.ipynb
SQLCreateInsertEtc.ipynb
SQLjoin.ipynb
SQLselect.ipynb
SQLsubqueriesAndSets.ipynb
stubs.ipynb
Transforming.ipynb
Transforming_20210914.ipynb
WorkingWithFiles.ipynb
XML.ipynb

 C:\Users\zambrana\OneDrive - The University of Kansas\LAS792_Fall2021\JupyterNotebooks
Folders: 

Files: 
Characters-checkpoint.ipynb
DataFrameIndexes-checkpoint.ipynb
Examples2021_02_18-checkpo

### Listing files as paths
This code only lists files, returning the full path of the file.

In [74]:
# make one list of all files including in all subfolders as full paths
import os
import pprint

fileList = []

for targetFolder, folders, files in os.walk(newPath):
    fileList += [os.path.join(targetFolder, file) for file in files]
pprint.pprint(fileList)
    

['C:\\Users\\zambrana\\OneDrive - The University of '
 'Kansas\\LAS792_Fall2021\\JupyterNotebooks\\Characters.ipynb',
 'C:\\Users\\zambrana\\OneDrive - The University of '
 'Kansas\\LAS792_Fall2021\\JupyterNotebooks\\DataFrameIndexes.ipynb',
 'C:\\Users\\zambrana\\OneDrive - The University of '
 'Kansas\\LAS792_Fall2021\\JupyterNotebooks\\dFcode.py',
 'C:\\Users\\zambrana\\OneDrive - The University of '
 'Kansas\\LAS792_Fall2021\\JupyterNotebooks\\Examples2021_02_18.ipynb',
 'C:\\Users\\zambrana\\OneDrive - The University of '
 'Kansas\\LAS792_Fall2021\\JupyterNotebooks\\Examples2021_09_07.ipynb',
 'C:\\Users\\zambrana\\OneDrive - The University of '
 'Kansas\\LAS792_Fall2021\\JupyterNotebooks\\Exercises2021_02_11.ipynb',
 'C:\\Users\\zambrana\\OneDrive - The University of '
 'Kansas\\LAS792_Fall2021\\JupyterNotebooks\\fixedHD5.h5',
 'C:\\Users\\zambrana\\OneDrive - The University of '
 'Kansas\\LAS792_Fall2021\\JupyterNotebooks\\IteratingDataFrames.ipynb',
 'C:\\Users\\zambrana\\OneDr

### A function listing files with a given extension
The function shown here uses a list comprehension having a regular expresion clause limiting the list to just files with the provided extension.

In [75]:
# function to list all files with a given extension
import os
import pprint
import re

def listFiles(targetPath, extension):
    fileList = []

    for targetFolder, folders, files in os.walk(newPath):
        fileList += [os.path.join(targetFolder, fileName) for fileName in files if re.search(r'\.'+extension+'$',
                                                                                             fileName) != None]
    return fileList

foundFiles = listFiles(Path(DataFolder,'data/firstDay'), 'txt')

pprint.pprint(foundFiles)

[]


### Computing checksums (hashing file contents)
We'll use hashlib for this to write code to hash the contents of files. There are higher level modules that will do this for us, but they are not part of the Anaconda distribution that we have installed.


example adapted from https://www.pythoncentral.io/hashing-files-with-python/

The code below reads the whole file into memory. This might blow up for very large files. A small modification to the code shown in the web page referenced above would read the file in blocks, allowing this to work for very large files.

In [76]:
import hashlib
import pprint 

def getMd5(filePath):
    hasher = hashlib.md5()
    with open(filePath, 'rb') as afile:
        buf = afile.read()
        hasher.update(buf)
    return hasher.hexdigest()

tupleList = list(zip([getMd5(aFilePath) for aFilePath in foundFiles],
                      foundFiles, ))
print(type(tupleList))

sorted(tupleList)
pprint.pprint(tupleList[0:2])

<class 'list'>
[]
