# File Reading

In this unit, we discuss file reading including different ways of reading in data from a file!

---
## Learning objectives

By the end of this unit, you should be able to…

- Write code to open a file
- Apply split to read information from a file
- Write code to read a csv file
- Write code to read only one line of a file


---
## File reading with loops

### Try it

What’s printed by the following code? **This is a poll question.**

|||
|---|---|
|A.|0|
|B.|1|
|C.|2|
|D.|3|
|E.|5|

foo.txt contains:<br>
May I help you?<br>No, thanks!<br>Do you like the snow?<br>I don’t know.<br>Bye!

In [None]:
my_file = open("foo.txt", 'r')
total = 0
for line in my_file:
    if 'no' in line:
        total = total + 1

print(total)

What if we change th code like below?

In [None]:
my_file = open("foo.txt", 'r')
total = 0
for line in my_file:
    if 'no' in line.lower():
        total = total + 1

print(total)

---
### Learn it

We can access a file on our computer using the open function.

In [None]:
# opens the file with the file name
# and saying we want to read it
my_file = open("foo.txt", 'r') 

# loops through each line of the file
for line in my_file:
    print(line[:3])

---

Using split, we can read in data from Comma Separated Value (CSV) files.

chats.csv contains:<br>
09/09/2020,Olsen,11:15am,Hello!<br>
09/09/2020,Chen,11:22am,Hi!<br>
09/09/2020,Olsen,11:34am,Would you like lunch?

In [None]:
my_file = open("chats.csv", "r")

for line in my_file:
    fields = line.split(",")
    print(fields[1])

---

What’s printed by the following code? **This is a poll question.**

|||
|---|---|
|A.|0|
|B.|2|
|C.|6|
|D.|9|
|E.|10|
|F.|13|
|G.|19|
|H.|Error!|

data.csv contains:<br>
7,9<br>
10,12<br>
3,16<br>
9,19

In [None]:
def file_mystery(filename):
    f = open(filename, 'r')
    biggest = 0
    for line in f:
        vals = line.split(',')
        diff = int(vals[1]) - int(vals[0])
        if diff > biggest:
            biggest = diff
      
    return biggest

print(file_mystery("data.csv"))

With this new file?

data2.csv contains:<br>
-5,1<br>
16,3<br>
4,5<br>
1,10<br>

In [None]:
print(file_mystery("data2.csv"))

**In groups:** Modify the function so it returns the biggest absolute difference.

In [None]:
def file_mystery_v2(filename):
    f = open(filename, 'r')
    biggest = 0
    for line in f:
        vals = line.split(',')
        diff = int(vals[1]) - int(vals[0])
        if diff > biggest:
            biggest = diff
      
    return biggest

---
The USGS collects data from every earthquake across the world.


In [None]:
f = open("significant_month.txt","r")

for line in f:
    print(line)

Which expression (XXX) will give us the name of the earthquake’s location? **This is a poll question.**

|||
|---|---|
|A.|fields[6]|
|B.|fields[6:]|
|C.|fields[-2:]|
|D.|fields[-1]|
|E.|It cannot be done with one expression.|


In [None]:
eq_file = open('significant_month.txt', "r")
for line in eq_file:
    fields = line.split()
    location = XXX

Printing location will result in us printing lists of strings, not a single string.

In [None]:
eq_file = open('significant_month.txt', "r")
for line in eq_file:
    fields = line.split()
    location = fields[6:]
    print(location)

We’ll use join to make sure the location name is a single string.

In [None]:
eq_file = open('significant_month.txt', "r")
for line in eq_file:
    fields = line.split()
    location = " ".join(fields[6:])
    print(location)

---
### Apply it

**In groups** What is printed while running this code?

eq2.txt containts:<br>
2.0 2006/10/19 02:02:10 62.391 -149.751 15.0 CENTRAL ALASKA<br>
4.0 2006/10/19 00:31:15 20.119 -156.213 1.5 MAUI REGION, HAWAII<br>
5.0 2006/10/18 21:15:51 4.823 -82.592 37.3 SOUTH OF PANAMA<br>
3.0 2006/10/18 21:12:25 59.934 -147.904 30.0 GULF OF ALASKA<br>


In [None]:
eq_file = open("eq2.txt", 'r')
sum = 0.0
num_lines = 0
for line in eq_file:
    fields = line.split()
    sum = sum + float(fields[0])
    num_lines = num_lines + 1

print(sum / num_lines)

What if eq2.txt was empty? What error does this code have? How would we change the code to avoid that error?

In [None]:
eq_file = open("eq2.txt", 'r')
sum = 0.0
num_lines = 0
for line in eq_file:
    fields = line.split()
    sum = sum + float(fields[0])
    num_lines = num_lines + 1

print(sum / num_lines)

---
## Readline

### Try it

**Group discussion:** What is printed if eq2.txt has a header row?

eq2.txt containts:<br>
mag date time lat long dep loc<br>
2.0 2006/10/19 02:02:10 62.391 -149.751 15.0 CENTRAL ALASKA<br>
4.0 2006/10/19 00:31:15 20.119 -156.213 1.5 MAUI REGION, HAWAII<br>
5.0 2006/10/18 21:15:51 4.823 -82.592 37.3 SOUTH OF PANAMA<br>
3.0 2006/10/18 21:12:25 59.934 -147.904 30.0 GULF OF ALASKA<br>


In [None]:
eq_file = open("eq2.txt", 'r')
sum = 0.0
num_lines = 0
for line in eq_file:
    fields = line.split()
    sum = sum + float(fields[0])
    num_lines = num_lines + 1

print(sum / num_lines)

---

### Learn it

We can use the file object’s readline method to read a single line at a time.

In [None]:
eq_file = open('significant_month.txt', "r")
eq_file.readline()

In [None]:
line2 = eq_file.readline()
print(line2)

In [None]:
eq_file.readline().split()

---
This code uses a priming read to initialize some of the data. A priming read is often used to deal with a "header row" in a data file.

In [None]:
eq_file = open('significant_month.txt', "r")
#reads the header
eq_file.readline()
# loop through the remaining lines
for line in eq_file:
    fields = line.split()
    location = " ".join(fields[6:])
    print(location)

---

Readline returns an empty string when it reaches the end of a file

In [None]:
f = open("eq2.txt", "r")
if f.readline() == "":
    print("The file is empty")

---

### Apply it

**In groups** What is printed while running the following code for these files?

temps.csv contains:<br>
7:00,62<br>
8:00,64<br>
9:00,68<br>
10:00,71<br>

temps2.csv contains:<br>
19:00,59<br>
20:00,61<br>
21:00,60<br>
22:00,55<br>


In [None]:
def temps_mystery(filename):
    f = open(filename, 'r')
    vals = f.readline().split(",")
    previous = int(vals[1])
    biggest = 0
    for line in f:
        vals = line.split(",")
        current = int(vals[1])
        diff = current - previous
        if diff > biggest:
            biggest = diff
        previous = current
      
    return biggest

In [None]:
print(temps_mystery("temps.csv"))

In [None]:
print(temps_mystery("temps2.csv"))