# Files

## Opening Files

To read a file, you must first open a file. This returns a file handle which you can used to then get the contents of a file. If the file doesn't exist this will throw an error.

    file_handle = open('filename.txt')
    

Once you are done with a file, you need to close it. Bad things can happen if you don't close your files, particularly  on locking filesystems.

    file_handle.close()


In [1]:
# Run these to get some of the files we will be using today. 
# These are the salaries of public workers in California from the website transparentcalifornia
# The last line is downloading a short story for the project

import urllib
urllib.urlretrieve("http://transparentcalifornia.com/export/san-francisco-2014.csv", "san-francisco-2014.csv")
urllib.urlretrieve("http://transparentcalifornia.com/export/san-francisco-2013.csv", "san-francisco-2013.csv")
urllib.urlretrieve("http://www.gutenberg.org/cache/epub/1952/pg1952.txt", "theyellowwallpaper.txt")

('theyellowwallpaper.txt', <httplib.HTTPMessage instance at 0x109d7c560>)

In [2]:
# Opening a file
fh = open('san-francisco-2014.csv')
print fh
fh.close()


<open file 'san-francisco-2014.csv', mode 'r' at 0x105e4df60>


In [3]:
# Opening a non-existent file
fh = open('i_dont_exist.txt')
print fh
fh.close()

IOError: [Errno 2] No such file or directory: 'i_dont_exist.txt'

### TRY IT
Open and close the san-francisco-2013.csv file

In [4]:
fh = open('san-francisco-2013.csv')
print fh
fh.close()

<open file 'san-francisco-2013.csv', mode 'r' at 0x109d6f5d0>


## Text files and lines

A text file is just a sequence of lines, in fact if you read it in all at once it is returns a list of strings.

Each line is separated by the new line character "\n". This is the special character that is inserted into text files when you hit enter (or you can deliberately put it into strings by using the special \n syntax).

In [5]:
print "Golden\nGate\nBridge"

Golden
Gate
Bridge


### TRY IT
Print your name on two lines using only one print statement 

In [6]:
print "Emily\nNielson"

Emily
Nielson


## Reading from files

There are two common ways to read through the file, the first (and usually better way) is to loop through the lines in the file. 

    for line in file_handle:
        print line

The second is to read all the lines at once and store as a string or list. 

    lines = file_handle.read() # stores as a single string
    lines = file_handle.readlines() # stores as a list of strings (separates on new lines)

Unless you are going to process the lines in a file several times, use the first method. It uses way less memory which will be useful if you ever have big files

In [12]:
fh = open('thingstodo.txt')
for line in fh:
    print line.rstrip()
fh.close()


Alcatraz
Golden Gate Bridge
Golden Gate Park
The Exploratorium
Pier 39


In [11]:
fh = open('thingstodo.txt')
contents = fh.read()
fh.close()

print contents
print type(contents)

fh = open('thingstodo.txt')
lines = fh.readlines()
fh.close()

print lines
print type(lines)

Alcatraz
Golden Gate Bridge
Golden Gate Park
The Exploratorium
Pier 39

<type 'str'>
['Alcatraz\n', 'Golden Gate Bridge\n', 'Golden Gate Park\n', 'The Exploratorium\n', 'Pier 39\n']
<type 'list'>


### TRY IT
Open 'san-francisco-2013.csv' and print out the first line. You can use either method. If you are using the loop method, you can 'break' after printing the first line.

In [17]:
fh = open('san-francisco-2013.csv')
 #for line in fh:
    #print line
    #break
lines = fh.readlines()
print lines
fh.close()

['employee_name,job_title,base_pay,overtime_pay,other_pay,total_benefits,total_pay,total_pay_benefits,year,notes,jurisdiction_name\n', 'Gregory P Suhr,Chief of Police,319275.01,0.00,20007.06,86533.21,339282.07,425815.28,2013,,San Francisco\n', 'Joanne M Hayes-White,"Chief, Fire Department",313686.01,0.00,23236.00,85431.39,336922.01,422353.40,2013,,San Francisco\n', 'Samson  Lai,"Battalion Chief, Fire Suppress",186236.42,131217.63,29648.27,57064.95,347102.32,404167.27,2013,,San Francisco\n', 'Ellen G Moffatt,Asst Med Examiner,272855.51,23727.91,38954.54,66198.92,335537.96,401736.88,2013,,San Francisco\n', 'Robert L Shaw,"Dep Dir for Investments, Ret",315572.01,0.00,0.00,82849.66,315572.01,398421.67,2013,,San Francisco\n', 'David L Franklin,Asst Chf of Dept (Fire Dept),215265.60,87985.24,30637.48,62890.36,333888.32,396778.68,2013,,San Francisco\n', 'Harlan L Kelly-Jr,Executive Contract Employee,313312.52,0.00,0.00,82319.51,313312.52,395632.03,2013,,San Francisco\n', 'John L Martin,Dept H

## Searching through a file

When searching through a file, you can use string methods to discover and parse the contents.

Let's look at a few examples

In [18]:
# Looking for a line that starts with something

# I want to see salary data of women with my first name
fh = open('san-francisco-2014.csv')
for line in fh:
    if line.startswith('Emily'):
        print line
fh.close()

Emily Lee,Senior Physician Specialist,178950.03,0.00,0.00,52859.83,178950.03,231809.86,2014,,San Francisco,FT

Emily Goldman,Attorney (Civil/Criminal),172150.70,0.00,1312.50,50927.47,173463.20,224390.67,2014,,San Francisco,FT

Emily Prescott,"Manager,Employee Relations Div",152325.27,0.00,0.00,56584.45,152325.27,208909.72,2014,,San Francisco,FT

Emily Murase,Dept Head I,142364.06,0.00,0.00,53545.01,142364.06,195909.07,2014,,San Francisco,FT

Emily M Morrison,Manager III,132690.02,0.00,0.00,50917.46,132690.02,183607.48,2014,,San Francisco,FT

Emily L Dahm,Attorney (Civil/Criminal),125830.00,0.00,5041.50,41764.64,130871.50,172636.14,2014,,San Francisco,FT

Emily R Luck,Diagnostic Imaging Tech IV,113566.58,6618.71,10067.18,39672.40,130252.47,169924.87,2014,,San Francisco,FT

Emily B Gerber,Manager I,112649.00,0.00,0.00,45432.00,112649.00,158081.00,2014,,San Francisco,FT

Emily S O'Rourke,Firefighter,96940.13,6000.31,6809.44,38479.04,109749.88,148228.92,2014,,San Francisco,FT

Emily R Watt

In [19]:
# Looking for lines that contain a specific string
fh = open('san-francisco-2014.csv')
# Looking for all the department heads
for line in fh:
    # Remember if find doesn't find the string, it returns -1
    if line.find('Dept Head') != -1:
        print line
fh.close()

John L Martin,Dept Head V,311298.55,0.00,0.00,89772.32,311298.55,401070.87,2014,,San Francisco,FT

Barbara A Garcia,Dept Head V,279839.22,0.00,2164.54,82884.27,282003.76,364888.03,2014,,San Francisco,FT

Naomi M Kelly,Dept Head V,267914.01,0.00,0.00,80361.22,267914.01,348275.23,2014,,San Francisco,FT

Trent E Rhorer,Dept Head V,267914.00,0.00,0.00,79799.88,267914.00,347713.88,2014,,San Francisco,FT

Jay P Huish,Dept Head V,267914.00,0.00,0.00,79799.88,267914.00,347713.88,2014,,San Francisco,FT

John S Rahaim,Dept Head IV,232489.33,0.00,0.00,72233.72,232489.33,304723.05,2014,,San Francisco,FT

Luis Herrera,Dept Head IV,226832.02,0.00,0.00,71586.74,226832.02,298418.76,2014,,San Francisco,FT

Mohammed C Nuru,Dept Head IV,219184.68,0.00,0.00,70095.24,219184.68,289279.92,2014,,San Francisco,FT

Anne M Kronenberg,Dept Head IV,219212.28,0.00,0.00,69397.69,219212.28,288609.97,2014,,San Francisco,FT

Tom C Hui,Dept Head III,202290.14,0.00,14958.35,68955.03,217248.49,286203.52,2014,,San Francisc

In [20]:
# Counting lines that match criteria
fh = open('san-francisco-2014.csv')
num_trainees = 0
for line in fh:
    # Remember if find doesn't find the string, it returns -1
    if line.find('Trainee') != -1:
        num_trainees += 1
fh.close()
print "There are {0} trainees".format(num_trainees)

There are 518 trainees


In [22]:
# Splitting lines, this is great for excel like data (tsv, csv)
# I want to see salary data of women with my name
fh = open('san-francisco-2014.csv')
for line in fh:
    if line.startswith('Emily'):
        cols = line.split(',')
        # Salary is 3rd column
        print cols[1], cols[2]
fh.close()

Senior Physician Specialist 178950.03
Attorney (Civil/Criminal) 172150.70
"Manager Employee Relations Div"
Dept Head I 142364.06
Manager III 132690.02
Attorney (Civil/Criminal) 125830.00
Diagnostic Imaging Tech IV 113566.58
Manager I 112649.00
Firefighter 96940.13
Senior Physician Specialist 101753.60
EMT/Paramedic/Firefighter 80594.00
Senior Administrative Analyst 98753.05
Senior Administrative Analyst 94229.51
Medical Social Worker 90185.02
Senior Administrative Analyst 89586.46
"Manager II  MTA"
Assoc Engineer 86868.14
Senior Administrative Analyst 82825.42
Program Specialist 82589.08
Estate Investigator 82052.03
Employment & Training Spec 3 81005.01
Secretary 2 65854.06
Payroll Clerk 66995.00
Executive Secretary 1 67037.39
Health Worker 3 62780.22
Special Nurse 77970.70
Fingerprint Technician 1 60141.11
HSA Sr Eligibility Worker 62095.45
IS Business Analyst-Senior 71136.77
Health Worker 3 58880.61
Accountant II 60320.29
Nutritionist 61297.53
Mayoral Staff IV 46688.03
Senior Personn

In [23]:
# Skipping lines
fh = open('thingstodo.txt')
for line in fh:
    if line.startswith('Golden'):
        continue
    print line
fh.close()

Alcatraz

The Exploratorium

Pier 39



## Try, except with open

If you are worried that the file might not exist, you can wrap the open in a try block

    try:
        fh = open('i_dont_exist.txt')
    except:
        print "File does not exist"
        exit()
        
  
    

In [24]:
# Opening a non-existent file
try:
    fh = open('i_dont_exist.txt')
    print fh
    fh.close()
except:
    print "File does not exist"
    #exit()


File does not exist


## Writing to files

You can write to files very easily. You need to give open a second parameter 'w' to indicate you want to open the file in write mode.
  
     fh_write = open('new_file.txt', 'w')

Then you call the write method on the file handle. You give it the string you want to write to the file. Be careful, `write` doesn't add a new line character to the end of strings like `print` does.

     fh_write.write('line to write\n')
     
Just like reading files, you need to close your file when you are done.

     fh_write.close()

In [32]:
fh = open('numbers.txt', 'w')
for i in range(10):
    fh.write(str(i) + '\n')
fh.close()
fh = open('numbers.txt')
lines = fh.readlines()
print lines
fh.close()


['0\n', '1\n', '2\n', '3\n', '4\n', '5\n', '6\n', '7\n', '8\n', '9\n']


### TRY IT
Create a file called 'my_favorite_cities.txt' and put your top 3 favorite cities each on its own line.

**Bonus** check that you did it correctly by reading the lines in python

In [34]:
fh = open('my_favorite_cities.txt', 'w')
fh.write("Boston\nDC\nMiami\n")
fh.close()

# Bonus

fh = open('my_favorite_cities.txt')
for line in fh:
    print line.rstrip()
fh.close()

Boston
DC
Miami


## With statement and opening files

You can use with to open a file and it will automatically close the file at the end of the with block. This is the python preferred way to open files. (Sorry it took me so long to show you)

    with open('filename.txt') as file_handle:
        for line in file_handle:
            print line
            
    # You don't have to close the file

In [35]:
with open('thingstodo.txt') as fh:
    for line in fh:
        print line.rstrip()

Alcatraz
Golden Gate Bridge
Golden Gate Park
The Exploratorium
Pier 39


You can also use with statements to write files

In [37]:
with open('numbers2.txt', 'w') as fh:
    for i in range(5):
        fh.write(str(i) + '\n')
with open('numbers2.txt') as fh:
    for line in fh:
        print line.rstrip()

0
1
2
3
4


### TRY IT

Refactor this code to use a with statement:
    # Counting lines that match criteria
    fh = open('san-francisco-2014.csv')
    num_trainees = 0
    for line in fh:
        # Remember if find doesn't find the string, it returns -1
        if line.find('Trainee') != -1:
            num_trainees += 1
    fh.close()
    print "There are {0} trainees".format(num_trainees)


In [39]:
with open('san-francisco-2014.csv') as fh:
    num_trainees = 0
    for line in fh:
        # Remember if find doesn't find the string, it returns -1
        if line.find('Trainee') != -1:
            num_trainees += 1
    print "There are {0} trainees".format(num_trainees)

There are 518 trainees


# Project 

We will calculate the average length of the first word in sentences in the short story "The Yellow Wallpaper" by Charlotte Perkins Gilman. (Feel free to use a different story, Project Gutenberg has many free ones. https://www.gutenberg.org/

1. Open the file in read mode using a with statement
2. Initialize two variables sum and count to the value of 0
3. Loop through each line. If the first character of the line is a capital letter (Check the strings lesson for the `in` keyword):
    * Add 1 to count
    * Split the line on spaces and find the length of the first word. Add this length to sum.
4. Calculate the average length of first words of sentences using the sum and count variables (be careful about integer division).
5. Open a new file 'ave_first_word_length.txt' in write mode using with statement
6. Print the title of the story on the first line and the average first word length on the second line.