# Files

## Opening Files

To read a file, you must first open a file. This returns a file handle which you can used to then get the contents of a file. If the file doesn't exist this will throw an error.

    file_handle = open('filename.txt')
    

Once you are done with a file, you need to close it. Bad things can happen if you don't close your files, particualrly on locking filesystems.

    file_handle.close()


In [1]:
# Run these to get some of the files we will be using today. 
# These are the salaries of public workers in California from the website transparentcalifornia
# The last line is downloading a short story for the project

import urllib
urllib.urlretrieve("http://transparentcalifornia.com/export/san-francisco-2014.csv", "san-francisco-2014.csv")
urllib.urlretrieve("http://transparentcalifornia.com/export/san-francisco-2013.csv", "san-francisco-2013.csv")
urllib.urlretrieve("http://www.gutenberg.org/cache/epub/1952/pg1952.txt", "theyellowwallpaper.txt")

('theyellowwallpaper.txt', <httplib.HTTPMessage instance at 0x1058d13b0>)

In [2]:
# Opening a file
fh = open('san-francisco-2014.csv')
print fh
fh.close()


<open file 'san-francisco-2014.csv', mode 'r' at 0x104c67c00>


In [3]:
# Opening a non-existant file
fh = open('i_dont_exist.txt')
print fh
fh.close()

IOError: [Errno 2] No such file or directory: 'i_dont_exist.txt'

### TRY IT
Open and close the san-francisco-2013.csv file

In [4]:
fh = open('san-francisco-2013.csv')
fh.close()

## Text files and lines

A text file is just a sequence of lines, in fact if you read it in all at once it is returns a list of strings.

Each line is separated by the new line character "\n". This is the special character that is inserted into text files when you hit enter (or you can deliberately put it into strings by using the special \n syntax).

In [5]:
print "Golden\nGate\nBridge"

Golden
Gate
Bridge


### TRY IT
Print your name on two lines using only one print statement 

In [6]:
print "Charlotte\nWeaver"

Charlotte
Weaver


## Reading from files

There are two common ways to read through the file, the first (and usually better way) is to loop through the lines in the file. 

    for line in file_handle:
        print line

The second is to read all the lines at once and store as a string or list. 

    lines = file_handle.read() # stores as a single string
    lines = file_handle.readlines() # stores as a list of strings (separates on new lines)

Unless you are going to process the lines in a file several times, use the first method. It uses way less memory which will be useful if you ever have big files

In [8]:
fh = open('thingstodo.txt')
for line in fh:
    print line.rstrip()
fh.close()


Alcatraz
Golden Gate Bridge
Golden Gate Park
The Exploratorium
Pier 39


In [10]:
fh = open('thingstodo.txt')
contents = fh.read()
fh.close()

print contents
print type(contents)

fh = open('thingstodo.txt')
lines = fh.readlines()
fh.close()

print lines
print type(lines)

Alcatraz
Golden Gate Bridge
Golden Gate Park
The Exploratorium
Pier 39

<type 'str'>
['Alcatraz\n', 'Golden Gate Bridge\n', 'Golden Gate Park\n', 'The Exploratorium\n', 'Pier 39\n']
<type 'list'>


### TRY IT
Open 'san-francisco-2013.csv' and print out the first line. You can use either method. If you are using the loop method, you can 'break' after printing the first line.

In [16]:
fh = open('san-francisco-2013.csv')
# for line in fh:
#     print line
#     break
lines = fh.readline()
print lines
fh.close()

employee_name,job_title,base_pay,overtime_pay,other_pay,total_benefits,total_pay,total_pay_benefits,year,notes,jurisdiction_name



## Searching through a file

When searching through a file, you can use string methods to discover and parse the contents.

Let's look at a few examples

In [17]:
# Looking for a line that starts with something

# I want to see salary data of women with my first name
fh = open('san-francisco-2014.csv')
for line in fh:
    if line.startswith('Charlotte'):
        print line
fh.close()

Charlotte L Jaques,"AprntcStatnry Eng,WtrTreatPlnt",86679.08,8924.69,5713.36,34054.46,101317.13,135371.59,2014,,San Francisco,FT

Charlotte R Kuo,Nurse Practitioner,83539.45,0.00,250.00,26455.08,83789.45,110244.53,2014,,San Francisco,PT

Charlotte C Wu,Senior Management Assistant,74163.11,0.00,0.00,30828.83,74163.11,104991.94,2014,,San Francisco,FT

Charlotte Coloyan Dela Cruz,Legal Secretary 1,66492.03,0.00,0.00,29062.63,66492.03,95554.66,2014,,San Francisco,FT

Charlotte E Grimes-Brown,Health Worker 2,59388.65,3880.56,2482.62,27328.09,65751.83,93079.92,2014,,San Francisco,FT

Charlotte B Coquia,Personnel Trainee,56456.54,0.00,1061.50,27700.77,57518.04,85218.81,2014,,San Francisco,FT

Charlotte R Sanders,Librarian 1,49527.35,0.00,943.69,19430.78,50471.04,69901.82,2014,,San Francisco,PT

Charlotte L Leung,HSA Sr Eligibility Worker,21401.54,0.00,14556.04,8669.72,35957.58,44627.30,2014,,San Francisco,PT

Charlotte L Vance,Public Svc Aide-Public Works,5392.91,989.32,53.28,4078.86,6435.51,

In [18]:
# Looking for lines that contain a specific string
fh = open('san-francisco-2014.csv')
# Looking for all the department heads
for line in fh:
    # Remember if find doesn't find the string, it returns -1
    if line.find('Dept Head') != -1:
        print line
fh.close()

John L Martin,Dept Head V,311298.55,0.00,0.00,89772.32,311298.55,401070.87,2014,,San Francisco,FT

Barbara A Garcia,Dept Head V,279839.22,0.00,2164.54,82884.27,282003.76,364888.03,2014,,San Francisco,FT

Naomi M Kelly,Dept Head V,267914.01,0.00,0.00,80361.22,267914.01,348275.23,2014,,San Francisco,FT

Trent E Rhorer,Dept Head V,267914.00,0.00,0.00,79799.88,267914.00,347713.88,2014,,San Francisco,FT

Jay P Huish,Dept Head V,267914.00,0.00,0.00,79799.88,267914.00,347713.88,2014,,San Francisco,FT

John S Rahaim,Dept Head IV,232489.33,0.00,0.00,72233.72,232489.33,304723.05,2014,,San Francisco,FT

Luis Herrera,Dept Head IV,226832.02,0.00,0.00,71586.74,226832.02,298418.76,2014,,San Francisco,FT

Mohammed C Nuru,Dept Head IV,219184.68,0.00,0.00,70095.24,219184.68,289279.92,2014,,San Francisco,FT

Anne M Kronenberg,Dept Head IV,219212.28,0.00,0.00,69397.69,219212.28,288609.97,2014,,San Francisco,FT

Tom C Hui,Dept Head III,202290.14,0.00,14958.35,68955.03,217248.49,286203.52,2014,,San Francisc

In [19]:
# Counting lines that match criteria
fh = open('san-francisco-2014.csv')
num_trainees = 0
for line in fh:
    # Remember if find doesn't find the string, it returns -1
    if line.find('Trainee') != -1:
        num_trainees += 1
fh.close()
print "There are {0} trainees".format(num_trainees)

There are 518 trainees


In [21]:
# Splitting lines, this is great for excel like data (tsv, csv)
# I want to see salary data of women with my name
fh = open('san-francisco-2014.csv')
for line in fh:
    if line.startswith('Charlotte'):
        cols = line.split(',')
        # Salary is 3rd column
        print cols[1], cols[2]
fh.close()

"AprntcStatnry Eng WtrTreatPlnt"
Nurse Practitioner 83539.45
Senior Management Assistant 74163.11
Legal Secretary 1 66492.03
Health Worker 2 59388.65
Personnel Trainee 56456.54
Librarian 1 49527.35
HSA Sr Eligibility Worker 21401.54
Public Svc Aide-Public Works 5392.91
Public Service Trainee 2625.93


In [22]:
# Skipping lines
fh = open('thingstodo.txt')
for line in fh:
    if line.startswith('Golden'):
        continue
    print line
fh.close()

Alcatraz

The Exploratorium

Pier 39



## Try, except with open

If you are worried that the file might not exist, you can wrap the open in a try block

    try:
        fh = open('i_dont_exist.txt')
    except:
        print "File does not exist"
        exit()
        
  
    

In [23]:
# Opening a non-existant file
try:
    fh = open('i_dont_exist.txt')
    print fh
    fh.close()
except:
    print "File does not exist"
    #exit()


File does not exist


## Writing to files

You can write to files very easily. You need to give open a second parameter 'w' to indicate you want to open the file in write mode.
  
     fh_write = open('new_file.txt', 'w')

Then you call the write method on the file handle. You give it the string you want to write to the file. Be careful, `write` doesn't add a new line character to the end of strings like `print` does.

     fh_write.write('line to write\n')
     
Just like reading files, you need to close your file when you are done.

     fh_write.close()

In [24]:
fh = open('numbers.txt', 'w')
for i in range(10):
    fh.write(str(i) + '\n')
fh.close()

### TRY IT
Create a file called 'my_favorite_cities.txt' and put your top 3 favorite cities each on its own line.

**Bonus** check that you did it correctly by reading the lines in python

In [25]:
fh = open('my_favorite_cities.txt', 'w')
fh.write('Atlanta\nBangkok\nSan Francisco?\n')
fh.close()

fh = open('/var/my_favorite_cities.txt')
for line in fh:
    print line
fh.close()

Atlanta

Bangkok

San Francisco?



## With statement and opening files

You can use with to open a file and it will automatically close the file at the end of the with block. This is the python preferred way to open files. (Sorry it took me so long to show you)

    with open('filename.txt') as file_handle:
        for line in file_handle:
            print line
            
    # You don't have to close the file

In [26]:
with open('thingstodo.txt') as fh:
    for line in fh:
        print line.rstrip()

Alcatraz
Golden Gate Bridge
Golden Gate Park
The Exploratorium
Pier 39


You can also use with statments with writing files

In [27]:
with open('numbers2.txt', 'w') as fh:
    for i in range(5):
        fh.write(str(i) + '\n')

### TRY IT

Refactor this code to use a with statement:
    # Counting lines that match criteria
    fh = open('san-francisco-2014.csv')
    num_trainees = 0
    for line in fh:
        # Remember if find doesn't find the string, it returns -1
        if line.find('Trainee') != -1:
            num_trainees += 1
    fh.close()
    print "There are {0} trainees".format(num_trainees)


In [28]:
# Counting lines that match criteria
with open('san-francisco-2014.csv') as fh:
    num_trainees = 0
    for line in fh:
        # Remember if find doesn't find the string, it returns -1
        if line.find('Trainee') != -1:
            num_trainees += 1
print "There are {0} trainees".format(num_trainees)

There are 518 trainees


# Project 

We will calculate the average length of the first word in sentances in the short story "The Yellow Wallpaper" by Charlotte Perkins Gilman. (Feel free to use a different story, Project gutenberg has many free ones. https://www.gutenberg.org/

1. Open the file in read mode using a with statement
2. Initalize two variables sum and count to the value of 0
3. Loop through each line. If the first character of the line is a capatal letter (Check the strings lesson for the `in` keyword):
    * Add 1 to count
    * Split the line on spaces and find the length of the first word. Add this length to sum.
4. Calculate the average length of first words of sentances using the sum and count variables (be careful about integer division).
5. Open a new file 'ave_first_word_length.txt' in write mode using with statement
6. Print the title of the story on the first line and the average first word length on the second line.