# Reading and writing files

Typing strings and numbers into an IPython Notebook are great ways to learn basics,
but sooner or later you will have to learn how to read data from a file, perform some analysis on that data and ideally save the analysis. 

But first we need to go over the basics of the filesystem so you know where and how things are located

#Filesystems

A filesystem is presented to you in a folder structure like this

<img src='../images/osx_finder.png'></img>

Where each folder is inside another folder and this keeps going all the way up. In fact this continues until we reach the root of the hard drive.

One way to visualize this differently is to think of the file system as a tree. In Linux/OS X the tree generically looks like this:

<img src='../images/linux_fs_tree.png' width='600px'></img>

On Windows it's very similar, it's just the root is `C:` instead of `/`

<img src='../images/windows_fs_tree.png' width='400px'></img>

When we want to see what is inside a folder we can use the `ls` command (it stands for `list`). `ls` is **not** a Python command (it comes from the terminal) but IPython notebook allows us to use this command natively in any code cell **as long as there is no Python code in the cell**.

If we wanted to see all of the files in this directory we can just use the following command:

In [5]:
ls *

Collection-Types.ipynb               Data-Visualization.ipynb             Review.ipynb
Data-Types.ipynb                     File-IO.ipynb                        Standard-Library.ipynb
Data-Types.pdf                       Flow-Control.ipynb                   Visualizations.ipynb
Data-Types.tex                       IPython-Notebook-Introduction.ipynb  visualization.ppt


`ls` means `list` and the `*` is what we call a `wildcard`. The `*` wildcard means match to everything. 

We can use it with some text to restrict what we display though. If we only want to see IPython notebooks we can do

In [7]:
ls *ipynb

Collection-Types.ipynb               File-IO.ipynb                        Review.ipynb
Data-Types.ipynb                     Flow-Control.ipynb                   Standard-Library.ipynb
Data-Visualization.ipynb             IPython-Notebook-Introduction.ipynb  Visualizations.ipynb


##But now how do we move out of our current folder?

There are two ways to access a path: (i) absolute and (ii) relative.

**Absolute paths** start from the *root* of the tree that we showed. On OS X or Windows that means the path will start with `/` or `C:\`. We just string together the folder names with the path separator to get to our current path.

**Note: I have this written for OS X, if you are using Windows change the `/` to `\`

We don't always have to use the root though, the `~` symbol stands for our user directory and we can start paths from there.

In [8]:
ls /

[1m[36mApplications[m[m/              [1m[36mUsers[m[m/                     [1m[36mdev[m[m/                       [1m[36mnet[m[m/                       [1m[35mtmp[m[m@
[1m[36mLibrary[m[m/                   [1m[46mVolumes[m[m/                   [1m[35metc[m[m@                       [1m[36mopt[m[m/                       [1m[36musr[m[m/
[1m[36mNetwork[m[m/                   [1m[36mbin[m[m/                       [1m[36mhome[m[m/                      [1m[36mprivate[m[m/                   [1m[35mvar[m[m@
[1m[36mSystem[m[m/                    [1m[36mcores[m[m/                     installer.failurerequests  [1m[36msbin[m[m/


In [9]:
ls ~/

[1m[36mAmaral[m[m/               [1m[36mCreative Cloud Files[m[m/ [1m[36mDropbox[m[m/              [1m[36mMovies[m[m/               [1m[36mPublic[m[m/
[1m[36mApplications[m[m/         [1m[36mDesktop[m[m/              [1m[36mGoogle Drive[m[m/         [1m[36mMusic[m[m/                [1m[36mSites[m[m/
[1m[36mBox Sync[m[m/             [1m[36mDocuments[m[m/            [1m[36mKellogg[m[m/              [1m[36mPictures[m[m/             [1m[36mscikit_learn_data[m[m/
[1m[36mCoding[m[m/               [1m[36mDownloads[m[m/            [1m[36mLibrary[m[m/              [1m[36mProjects[m[m/


In [10]:
ls ~/Desktop

[1m[36m$RECYCLE.BIN[m[m/    Thumbs.db        [1m[35m[C] Windows 8.1[m[m@


**Relative paths** start from where you **currently** are. 

The symbol for your **current** directory is `.`

The symbol for the **parent** directory (the folder above you) is `..`

In [11]:
ls .

Collection-Types.ipynb               Data-Visualization.ipynb             Review.ipynb
Data-Types.ipynb                     File-IO.ipynb                        Standard-Library.ipynb
Data-Types.pdf                       Flow-Control.ipynb                   Visualizations.ipynb
Data-Types.tex                       IPython-Notebook-Introduction.ipynb  visualization.ppt


In [12]:
ls ../

[1m[36mAdvanced-Programming[m[m/    [1m[36mHandouts[m[m/                [1m[36mProjects[m[m/                README.md                [1m[36mimages[m[m/
[1m[36mBasic-Basics[m[m/            [1m[36mHomeworks[m[m/               [1m[36mPrologue[m[m/                [1m[36mSpecial-Topics[m[m/          [1m[36mstyles[m[m/
[1m[36mData[m[m/                    LICENSE.md               [1m[36mPython-Basics[m[m/           [1m[36mStatistical-Programming[m[m/


Here is the structure of the `Bootcamp/` folder that we just downloaded.

<img src='../images/bootcamp_structure.svg'></img>

For this session we're going to use files that are in the `Data/` folder and right now we are using the `File-IO.ipynb` notebook. 

Show me what is in the `Data/` folder

#Now for reading files

Inside the `Data/` folder we have another folder labelled `Roster/`. The `Roster` file is full of lots of small `.txt` files (just raw text). Each file looks something like this:

    #This is a file that holds important personal information that should not be shared. You are being watched.

    Name:	Agatha A. Bailey
    Date of Birth:	1/10/75
    Email Address:	agatha.bailey@northwestern.edu
    Department:	Engineering
    Height:	6ft,0in
    Weight:	220lbs
    Favorite Color:	Lime
    Favorite Animal:	Turtle
    Zodiac Sign:	January
    
You have all of these files because you just got a new job as an administrator in a department at Northwestern University. Congratulations!

Since you're the new administrator you want to calculate some basic properties of the student body population.

When we work with **any** new data the first step is to **look** at it. Print parts of it. Make sure that you're familiar with all the data types before thinking about doing any real calculations with it.

Now-how are we going to process this information? Especially because everything will be coming in as a string???

# Thinking algorithmically

Human brains are great at reducing the complexity of problems so that the answers seem obvious. 
If I tell you my birthday and ask you to tell me how old I am most of you can give me an answer in
almost no time, but making your thought process explicit can be difficult. 

To do any programming 
or data analysis, perhaps the most important thing that you need to learn is how to break down a problem (that might seem really simple for you to do in your head) into tiny little steps such that 
you can teach a computer how to do it.

Let's start with an exercise, how old am I?

In [2]:
### Here is my information
birth_month = 2
birth_day = 25
birth_year = 1984

current_month = 3
current_day = 24
current_year = 2015

In [4]:
##Place your code here to calculate the birthday

##How old am I?
# print(age)

So now we see how have to break down all of these problems.

Let's move onto actually reading a file.

In [4]:
myFile = open('../Data/Roster/Agatha_Bailey_798.txt', 'r')
myFile

<_io.TextIOWrapper name='../Data/Roster/Agatha_Bailey_798.txt' mode='r' encoding='UTF-8'>

In [8]:
myFile = open('../Data/Roster/Agatha_Bailey_798.txt', 'r')
Agatha = myFile.read()
Agatha

'#This is a file that holds important personal information that should not be shared. You are being watched.\n\n\n\nName:\tAgatha A. Bailey\nDate of Birth:\t1/10/75\nEmail Address:\tagatha.bailey@northwestern.edu\nDepartment:\tEngineering\nHeight:\t6ft,0in\nWeight:\t220lbs\nFavorite Color:\tLime\nFavorite Animal:\tTurtle\nZodiac Sign:\tJanuary\n'

In [9]:
myFile = open('../Data/Roster/Agatha_Bailey_798.txt', 'r')
Agatha = myFile.readlines()
print(Agatha)

['#This is a file that holds important personal information that should not be shared. You are being watched.\n', '\n', '\n', '\n', 'Name:\tAgatha A. Bailey\n', 'Date of Birth:\t1/10/75\n', 'Email Address:\tagatha.bailey@northwestern.edu\n', 'Department:\tEngineering\n', 'Height:\t6ft,0in\n', 'Weight:\t220lbs\n', 'Favorite Color:\tLime\n', 'Favorite Animal:\tTurtle\n', 'Zodiac Sign:\tJanuary\n']


In [11]:
print( type(Agatha) )
print( len(Agatha) )
print( type(Agatha[0]) )
print( len(Agatha[0]) )
print( Agatha[0] )

<class 'list'>
13
<class 'str'>
108
#This is a file that holds important personal information that should not be shared. You are being watched.



# That's it! 

#You read a file and it is now a data type we understand!

But let's see if you can tell me how old Agatha is, first we'll need to go from a line in that file to the variables that we used above to calculate someone's age

In [14]:
for line in Agatha:
    if 'Name' in line:
        print(line)
#         birthday_line = line
# print(birthday_line)

Name:	Agatha A. Bailey



# A refresher on manipulating strings

In [None]:
temporary_line = 'Adam,Hockenberry,02-25-1984\n'

In [None]:
print(temporary_line)

In [None]:
print(temporary_line.strip('\n'))

In [None]:
print(temporary_line.split(','))

In [None]:
print(temporary_line.strip('\n').split(','))

In [None]:
line_as_list = temporary_line.strip('\n').split(',')
print(line_as_list)

In [None]:
DOB = line_as_list[2]
print(DOB)

In [None]:
birth_day = DOB.split('-')[0]
birth_month = DOB.split('-')[1]
birth_year = DOB.split('-')[2]
print(birth_day, birth_month, birth_year)

# Back to Agatha...

In [None]:
for line in Agatha:
    if 'Date of Birth' in line:
        print(line)
        birthday_line = line

**Exercise:** apply the string manipulations that you just learned to get Agatha's birthday as variables that we can use

In [None]:
# birth_day = 
# birth_month = 
# birth_year = 

# We have all the parts but they're pretty scattered right now, let's put it all in one place:

In [None]:
current_month = 3
current_day = 24
current_year = 2015

myFile = open('../Data/Roster/Agatha_Bailey_798.txt', 'rU')
Agatha = myFile.readlines()
for line in Agatha:
    if 'Date of Birth' in line:
        birthday_line = line
###########################################################################
###Paste your code to get birth_month, birth_day, and birth_year here!


print(birth_month, birth_day, birth_year)

###########################################################################
#### Copy and paste the algorithm that you developed to calculate someones age here!

        
# print(age)

# Writing files

If you perform some calculation, there are a number of reasons why you should store these values somewhere. 

In [None]:
### Here is a dictionary of TA's names/ages
names = ['Adam H', 'Peter W', 'Joao M', 'Hyojun L']
ages = [21, 31, 24, 19]
# for x in zip(names, ages):
#     print(x)
age_dictionary = dict(zip(names, ages))

In [None]:
delimiter = ','

# output_file = open('../Data/TA_ages.csv', 'w')

# for name, age in age_dictionary.items():
#     output_file.write(name + delimiter + age + '\n')
# #     print(name + delimiter + str(age) + '\n', file=output_file)
# #     output_file.write('{}{}{}\n'.format(name, delimiter, age))
# output_file.close()


####Here is another way that should do the exact same thing
# for name in age_dictionary.keys():
#     output_file.write(name + delimiter + str(age_dictionary[name]) + '\n')
# output_file.close()

####Here is yet another way that should also do the exact same thing
# with open('../Data/TA_ages.csv', 'w') as output_file:
#     for name, age in age_dictionary.items():
#         output_file.write(name + delimiter + str(age) + '\n')
# output_file.close()
    

# What if we forgot someone?

Take a deep breath. All is not yet lost.

When opening files, we have used 'r' and 'w' for read and write but there is one more that I haven't told you about: 'a' for append.

In [None]:
new_age_dictionary = {'Adam P': 38, 'Nick T': 28}
delimiter = ','

output_file = open('../Data/TA_ages.csv', 'a')
for name, age in new_age_dictionary.items():
    output_file.write(name + delimiter + str(age) + '\n')    
output_file.close()

# Final exercise
Let's read that same file again but instead of calculating Agatha's age, I'd like to know her height in centimeters (it's for a collaboration with europeans). When you're finished, please write her name and her height into a new file that looks like this (it will be pretty boring for now, with only two lines):

    Name, Height (cm)
    Agathas_full_name, Agathas_height
    


***If you get stuck, remember to break the problem down into small steps:***
    
    1) Read the file and find the lines that we care about:
        a) name_line
        b) height_line
    2) Strip those lines apart so that we have three variables:
        a) name
        b) height_component_feet
        c) height_component_inches
    3) Get her height in inches:
        a) total_height_inches
    4) Then convert it into centimeters
    5) Write everything into a new file

In [None]:
####Load this file
input_file_name = '../Data/Roster/Agatha_Bailey_798.txt'

########################
###Place your code here:


###Write data into this file
output_file_name = '../Data/roster_heights.csv'

In [1]:
from IPython.core.display import HTML

def css_styling():
    styles = open("../styles/presentation.css", "r").read()
    return HTML(styles)
css_styling()