# Synopsis

In this unit we will earn an important skill, how to read and write files. In order to do that, we will cover:

1. How files are organized on a computer and how to view/find file locations with terminal commands from the notebook
2. How to open a file and read it
3. How to open a file and write text to it. 

# Read libraries

In [None]:
from IPython.core.display import HTML
from IPython.lib.display import YouTubeVideo

import datetime

# Videos

In [None]:
vid = YouTubeVideo('pTT7HMqDnJw', width = 600)
display(vid)


# Filesystems

Typing strings and numbers into a `Jupyter` Notebook is a great way to learn the basics.

However, sooner or later you will have to learn how to read data from a file, perform some analyses on that data, and (ideally) save results of the analyses. 

In order to do this effectively, we first must go over the basics of `filesystems` so you understand how data is organized in your storage drive.

**Mac OS** `filesystems` are organized in similar manner to **Unix** `filesystems`. 

> It all starts at `/` -- the root!  

The root location holds several folders, as shown below by the names in blue

<img src = 'Images/dir_list.png' width = '700px'></img>

These folders can be seen as branches of a bush with root at `/`.  Of particular interest is the `Users` branch which stores the home accounts.  

My home account is named `amaral` and it is located 2 levels above the root `/`.

<img src = 'Images/more_dir_lists.png' width = '700px'></img>

The location of a file or folder can be specified in an absolute or relative manner using a string that reports the intermediate branches between the file and the root or some relative location, respectively. The intermediate branches are separated by `/`.

The absolute path for the file `update.zip` is

> /Users/amaral/update.zip

The relative path with regards to the folder `Users` is

> amaral/update.zip


On **Windows**, things are not that different. There is still a root which is located at `C:\` instead of `/` and the separators are `\` instead of `/`.



In the image above, you can see that a lot of information is provided for each item in a folder. For example,

<img src = 'Images/file_info.png' width = '700px'></img>

> The first column of codes contains information of the **privileges/permissions** of different types of accounts. 
>
> The second column indicates the number of real items inside the folder `share` (don't worry about it)
>
> The third column shows the **owner** of the file or folder. 
>
> The fourth column shows the name of the **group** with special permissions for the file or folder, 
>
> The fifth column shows the **size** of the file or folder.
>
> The sixth column shows **the last date and time** (if in the same calendar year) the file was modified. 
>
> The seventh column shows the **name** of the file or folder.


## `os`, `pwd`, and `grp` $-$ libraries for interacting with the filesystem

We can read more in the manual pages for:

* [`os`](https://docs.python.org/3/library/os.html)

* [`pwd`](https://docs.python.org/3/library/pwd.html)

* [`grp`](https://docs.python.org/3/library/grp.html)

In [None]:
conda install pwd

In [None]:
import os

# Comment the lines below if on Windows since these are not available
import pwd
import grp


In [None]:
help(os)

In [None]:
print( os.name )
print()

# Comment comand below if on Windows.
#
print( os.uname() )

In all likelihood, you are getting different outputs from me when you run that cell.  That is because we have different computers that are running different operating systems.

An important thing to know when writing code is to know what is the current working directory (or folder). That is, from where is the computer looking for files stored in your disk.

In [None]:
help(os.getcwd)
print()

our_working_directory = os.getcwd()
print(our_working_directory)

<br>

<br>

<br>


As before, you are likely getting something quite different from me.  

ASIDE: *Many many years ago, when Windows did not yet exist and Microsoft's OS was named DOS, Microsoft used `/` to indicate flag values. For this reason, they were 'forced' to use `\` for separating folders.*

The cool thing is that because you downloaded the course files into their correct places, everything below the parent directory to the working directory will be located in the same relative places.



In [None]:
help(os.listdir)


In [None]:
sorted(os.listdir('/Users/'))

The output of `os.listdir` is a list, so we can access its elements: 

In [None]:
ls -lht /Users/

<br>

The default input for `.listdir` is the current working directory.

In [None]:
files = sorted( os.listdir() )
files

In [None]:
ls -lht 

In [None]:
print(files[-1])
print()

properties = os.stat(files[-1])
print('---', properties)
print()

In [None]:
print(f"The size of my file is {properties.st_size} B\n")

print(f"The user id of the file's owner is {properties.st_uid}.  ")
#       f"Their username is {pwd.getpwuid(properties.st_uid).pw_name}.\n")

print(f"The group id of the file is {properties.st_gid}.  ")
#       f"The group name is {grp.getgrgid(properties.st_gid).gr_name}.\n")

print(f"There are {properties.st_nlink} hard links connected to this file."
      f"  Unless, this is a folder, the number should be 1.\n")

create_time = ( datetime.datetime(1970, 1, 1, 0, 0)
                + datetime.timedelta(seconds = properties.st_ctime) )
print(f"The file was created {properties.st_ctime:.1f} seconds after "
      f"Jan 1st, 1970. \nIn human readable format that translates to "
      f"{create_time.strftime('%b %d, %Y, %H:%M')}.\n")


<br>

<br>

If  you are using a `Mac` or a `Linux` machine, then you can read all the groups that exist in the system by accessing the groups database:

In [None]:
for name in grp.getgrall():
    if 'amaral' in name.gr_mem:
        print(name)
        
grp.getgrnam('staff')

## Ownership and permissions

You can use so-called **shell** commands in a terminal to navigate the filesystem, to create files and folders, to delete, copy, and move files and folders and to run programs.  The things that you can do depend on your privileges which are related to the account and group ownership of the file or folder.

The privilege information is organized according ownership level.  The three ownership levels are **user**, **group**, and **all**. User refers to the privileges of the account that owns the file. In the example above, all files are owned by the `amaral` account.

Users of a filesystem can be organized into groups.  A user account can belong to several groups, but a file or folder can only be assigned to a single group.  Groups enable different users -- for example, collaborators on a project -- to share greater privileges in accessing a file or folder.

All accounts -- including those that run services for the computer, such as communicating with the printer, or the mouse -- fall into the last ownership level: all.

The first code in the set of 10 characters of the first column indicates whether the name at the end refers to a directory (will display a `d`), a symbolic link (`l`), or a regular file (`-`).

The following 3 characters concern reading (`r`), writing (`w`), and execution (`x`) permissions for the owner of the file or folder.  Files storing data can typically be read and written. If you want to make sure a file is not overwritten, you can remove writing privileges from everyone including the file's owner.  

The following 3 characters concern reading (`r`), writing (`w`), and execution (`x`) permissions for all the users belong to the relevant group.

The final 3 characters concern reading (`r`), writing (`w`), and execution (`x`) permissions for everyone else.

In [None]:
# Note that you have to use octogonal representation in order to 
# properly read the permissions

print(f"The permissions for the file are {oct(os.stat('./roster_lib.py').st_mode)}.\n")


This is actually somewhat tricky to translate.  The last 3 digits above $-$ `644` $-$ are codes for the permissions. In order to understand their meaning, you must realize that `rwx` can be seen as bit values in which the presence of the letter means `1` and `-` means `0`.

For example, `r--` translates to the binary number `100`, which in the decimal system corresponds to 4 + 0 + 0 = 4.

`rwx` translates to the binary number `111`, which in the decimal system corresponds to 4 + 2 + 1 = 7.

You can use `os.chmod()` to change the permissions of a file. **You can use the system above as long as you specify that the value you are passing to the function is in octagonal notation.**

In [None]:
help(os.chmod)

In [None]:
os.chmod('./roster_lib.py', 0o644)

In [None]:
ls -lht roster_lib.py

## Traveling the filesystem tree

As was mentioned earlier, there are two ways to report a path: 

> absolute 
>
> relative.

**Absolute paths** start from the *root* of the tree that we showed. On OS X or Windows that means the path will start with `/` or `C:\`. We just string together the folder names with the path separator to get to our current path.

**Note: I have this written for OS X, if you are using Windows change the `/` to `\`**

In order to change working folder we use the method `os.chdir`:

In [None]:
os.chdir('/')

In [None]:
os.getcwd()

In [None]:
os.listdir()

Or we can travel to the home directory

In [None]:
os.chdir('/Users/amaral')

In [None]:
os.getcwd()

In [None]:
os.listdir()

**Relative paths** start from where you **currently** are.  The symbol for your **current** directory is `.` The symbol for the **parent** directory (the folder above you) is `..`

Annoyingly, `...` is not used, so we cannot easily get to the grandparent directory.

In [None]:
os.listdir('.')

In [None]:
folder_contents = os.listdir('..')
folder_contents

**Notice that `folder_contents` is a `list` of `strings`.**

In [None]:
print(type(folder_contents))
print()
print(type(folder_contents[1]))

In [None]:
os.listdir('../..')

<br>


<br>

**MAKE SURE YOU RUN THE NEXT CELL!!!**

Otherwise, you will be working in the wrong directory.


In [None]:
os.chdir(our_working_directory)


You should now be back to the `Module_An_Intro_to_Python` directory.

In [None]:
os.getcwd()

In [None]:
os.listdir()

# `pathlib.Path` -- Constructing OS independent path objects

For details on this library read the [package documentation](https://docs.python.org/3/library/pathlib.html).

For why `pathlib ` is great, read [this](https://treyhunner.com/2018/12/why-you-should-be-using-pathlib/).

In [None]:
from pathlib import Path


In [None]:
help( Path.cwd )

In [None]:
print(Path.cwd())

But what is `Path.cwd()`?

In [None]:
print(type( Path.cwd() ))

In [None]:
help( Path.glob )

## `glob` - Unix-style pathname pattern expansion

For details on this library read the [package documentation](https://docs.python.org/3/library/glob.html).


In [None]:
current_folder = Path.cwd()
print(current_folder)
print()


In [None]:
pattern = '*e*'
print( current_folder.glob(pattern) )
print()


In [None]:
for file in current_folder.glob(pattern):
    print(f"...{str(file)[103:]}")

In [None]:
parent_folder = current_folder.parent
print(parent_folder)
print()


grandparent_folder = parent_folder.parent
print(grandparent_folder)
print()

The cool thing about `pathlib` is that it enables you to build paths simply by adding folder and file names in the correct sequence.

In [None]:
data_folder = current_folder / 'Data'
print(data_folder)
print()

for file in data_folder.glob('*'):
    print(f"...{str(file)[103:]}")

This package also allows you access to glob in an easy manner:  `.rglob` goes recursively, whereas `.glob` does not.

In [None]:
for file in data_folder.rglob('*'):
    print(f"...{str(file)[103:]}")

<br>

<br>



`Path` enables us to get the parts of a path.

In [None]:
current_folder.parts

## `os` paths versus `pathlib` paths

In [None]:
# pathlib.Path operates on Posix Paths
#
print(type(parent_folder))
print()


# os paths are just strings
#
current_folder = os.getcwd()
print(current_folder)

print()
print(type(current_folder))


We can enter both path strings and Posix Path objects to `os` modules

In [None]:
# Folder contents

print( os.listdir(parent_folder) )
print()
print( os.listdir(current_folder) )

# Reading files

Inside the `../Data/` folder we have another folder labelled `Roster/`. 

The `Roster` file is full of lots of small `.txt` files (just raw ASCII text). Each file looks something like this:

---
```
#This is a file that holds important personal information that should not be shared. You are being watched.




Name:	Agatha A. Bailey
Date of Birth:	1/10/75
Email Address:	agatha.bailey@northwestern.edu
Department:	Engineering
Height:	6ft,0in
Weight:	220lbs
Favorite Color:	Lime
Favorite Animal:	Turtle
Zodiac Sign:	January
```   
---

You just got hired as IT specialist at Northwestern University. Congratulations!

It is now your responsibility to deal with the security of all these files containing private individual information (PII).

Your boss asked you to do an analysis of the demographics of the university staff.  You know, age, gender, favorite color.


<br>

<br>

<br>


When we work with **any** new data the first step is to **look** at it. Print parts of it. Make sure that you're familiar with all the data types before thinking about doing any real calculations with it.

So, let's start with an example file.

In [None]:
# Where our data sits
roster_path = Path.cwd() / 'Data' / 'Roster'
print(roster_path)
print()

# Pick one file
agatha_path = roster_path / 'Agatha_Bailey_798.txt'
print(agatha_path)
print()


In [None]:
# Read the file 
with open(agatha_path, 'r') as my_file:
    print(my_file)
    print()

In [None]:
# Read the file and and print its content
with open(agatha_path, 'r') as my_file:
    agatha_data = my_file.read()


print(type(agatha_data))
print('------')
print(agatha_data)
print('------')

In [None]:
# Read file inot a list of strings, one for each line
with open(agatha_path, 'r') as my_file:
    agatha_data = my_file.readlines()
    
print(type(agatha_data))
print('------')
print(len(agatha_data))
print('------')
print(agatha_data)
print('------')

In [None]:
for i, line in enumerate(agatha_data):
    print(f"{i:>2} >>{line}<<")

In [None]:
print( type(agatha_data) )  
print()
print( len(agatha_data) )


In [None]:
print( type(agatha_data[0]) )
print()
print( len(agatha_data[0]) )
print()
print( agatha_data[0] )

## Reading many many files

The power of computer is that, unlike humans, they can easily 'read' vast numbers of files.  How do we intruct the computer to do it, though?

That is where the package `glob` comes in!

In [None]:
my_files = roster_path.glob('*.txt')
roster_files = list( my_files )

print(f"my_files is a {type(my_files)}")
print()
print(f"roster_files is a {type(roster_files)}")
print('------')
print(f"There are {len(roster_files)} roster_files")


In [None]:
print(roster_files[:5])

# Writing files

If you perform some calculation, there are a number of reasons why you should store these values somewhere. 

There are three primary ways to store data: raw text, comma separated values (`csv`), and `json`.

## The dumb way

You can simple dump your data as string into a file, just as if you were printing to the screen.

**But you should not do this.**

**Why would you force yourself to redo all that processing of files that contain unstructured text?** 



In [None]:
file_path = current_folder / 'roster.txt'

with open(file_path, 'w') as file_out:
    file_out.write( "".join(agatha_data) )
    

In [None]:
os.listdir()

In [None]:
!cat roster.txt

## The OK way

You can save your data as a table as if it was a spreadsheet.  This format is called `CSV` (comma separated values) and you can store data that has a list of lists structure.  

This is a bit better than unstructured text.  **If you use `pandas` you can even recover the information concerning data types.**


In [None]:
file_path = current_folder / 'roster.csv'

with open(file_path, 'w') as file_out:
    file_out.write( ",".join(agatha_data) )
    

In [None]:
os.listdir()

In [None]:
!cat roster.csv

## A better way

You should save your data using the [`JSON` format](http://json.org/). With `JSON`, we can store Python lists and dictionaries using structured text files.

When we read/write files it goes instantly from the raw text to a python data object.

Next, we will use `JSON` to store and retrieve structured data.

First, we have to import the package.

In [None]:
import json

In [None]:
# Use the items in roster_files to create a list of 
# tuples with First and Last Names of everyone.

roster_names = []
for filename in roster_files[:10]:  # to start limit the number of files you load
    # Your code here

print(roster_names)

In [None]:
os.listdir()

In [None]:
!cat roster_names.json

You can now load the data from the `json` file...

In [None]:
with open(cwdir / 'roster_names.json', 'r') as json_file:
    loaded_names = json.load(json_file)

print(loaded_names)

# Exercises

Use a `for` loop to create a list of lists with the names (first and last) of five of your friends.

Write your list to files using each of the three methods described above.

Use the `os` package to make sure the files exist and to get their permissions in octogonal representation.

Change the permissions of the `.txt` file so that no user (including you) can write or execute it and that you are the only user that can read it.