# Notes for 03 October

Reading files with loops

First we'll fetch a file from the network.

In [1]:
import urllib.request
url = 'http://goo.gl/UG0JBi'
urllib.request.urlretrieve(url, 'sometext.txt')

('sometext.txt', <http.client.HTTPMessage at 0x7f66580ac198>)

We can open the file with the open function. The object we get back is a file, I'll call it fp.

In [2]:
fp = open('sometext.txt', 'r')

In [3]:
fp

<_io.TextIOWrapper name='sometext.txt' mode='r' encoding='UTF-8'>

I can read a line from the file with its `readline` method.

In [4]:
fp.readline()

'This is a file\n'

Calling it again produces the next line. This implies that the object is keeping track of how far we have advanced into the file.

In [5]:
fp.readline()

'with several lines\n'

In [6]:
fp.readline()

'of text.\n'

In [7]:
fp.readline()

'In Python we can access\n'

In [8]:
fp.readline()

'it as a sequence of lines\n'

In [9]:
fp.readline()

'using the open function.\n'

When we get to the end of the file it will return an empty string.

In [10]:
fp.readline()

''

In [11]:
fp.readline()

''

I could try to read all the lines using several statements. But this is a waste of typing.

In [12]:
fp = open('sometext.txt', 'r')
line0 = fp.readline()
line1 = fp.readline()
line2 = fp.readline()
line3 = fp.readline()
line4 = fp.readline()
line5 = fp.readline()

In [13]:
print(line0,line1,line2,line3,line4,line5)

This is a file
 with several lines
 of text.
 In Python we can access
 it as a sequence of lines
 using the open function.



Instead, I'll read them in a loop. The `for` statement will iterate over any kind of sequence. The statements indented under it will execute once for each line in the file.

In [14]:
lines = []
for line in open('sometext.txt', 'r'):
    lines.append(line)
print(lines)

['This is a file\n', 'with several lines\n', 'of text.\n', 'In Python we can access\n', 'it as a sequence of lines\n', 'using the open function.\n']


In [None]:
We talked briefly about lists to remind ourselves how they work.

In [15]:
L = []
L.append('foo')
print(L)

['foo']


In [16]:
L.append('bar')
print(L)

['foo', 'bar']


I'll read the lines again, this time, `strip`ing the newlines off of them. Now now the list of strings don't have that '\n' character at the end.

In [17]:
lines = []
for line in open('sometext.txt', 'r'):
    lines.append(line.strip())
print(lines)

['This is a file', 'with several lines', 'of text.', 'In Python we can access', 'it as a sequence of lines', 'using the open function.']


We talked about split and how it can divide a string in a list of strings. The default is to divide at spaces.

In [18]:
'this is a string'.split()

['this', 'is', 'a', 'string']

We can bust each line into words and now we've got two nested loops! The outer loop reads the lines, strips, and splits them. The inner loop takes the words and adds them to the list of words.

In [19]:
words = []
for line in open('sometext.txt', 'r'):
    linewords = line.strip().split()
    for word in linewords:
        words.append(word)
print(words)

['This', 'is', 'a', 'file', 'with', 'several', 'lines', 'of', 'text.', 'In', 'Python', 'we', 'can', 'access', 'it', 'as', 'a', 'sequence', 'of', 'lines', 'using', 'the', 'open', 'function.']


We can decorate our code with some print calls so we can see what is happening.

In [20]:
words = []
for line in open('sometext.txt', 'r'):
    print('line=', line)
    linewords = line.strip().split()
    print('linewords=', linewords)
    for word in linewords:
        print('word=', word)
        words.append(word)
print(words)

line= This is a file

linewords= ['This', 'is', 'a', 'file']
word= This
word= is
word= a
word= file
line= with several lines

linewords= ['with', 'several', 'lines']
word= with
word= several
word= lines
line= of text.

linewords= ['of', 'text.']
word= of
word= text.
line= In Python we can access

linewords= ['In', 'Python', 'we', 'can', 'access']
word= In
word= Python
word= we
word= can
word= access
line= it as a sequence of lines

linewords= ['it', 'as', 'a', 'sequence', 'of', 'lines']
word= it
word= as
word= a
word= sequence
word= of
word= lines
line= using the open function.

linewords= ['using', 'the', 'open', 'function.']
word= using
word= the
word= open
word= function.
['This', 'is', 'a', 'file', 'with', 'several', 'lines', 'of', 'text.', 'In', 'Python', 'we', 'can', 'access', 'it', 'as', 'a', 'sequence', 'of', 'lines', 'using', 'the', 'open', 'function.']


In [21]:
words = []
for line in open('sometext.txt', 'r'):
    print('line=', line)
    linewords = line.strip().split()
    print('linewords=', linewords)
    for word in linewords:
        print('word=', word)
        words.append(word)
        print('words=', words)
print(words)

line= This is a file

linewords= ['This', 'is', 'a', 'file']
word= This
words= ['This']
word= is
words= ['This', 'is']
word= a
words= ['This', 'is', 'a']
word= file
words= ['This', 'is', 'a', 'file']
line= with several lines

linewords= ['with', 'several', 'lines']
word= with
words= ['This', 'is', 'a', 'file', 'with']
word= several
words= ['This', 'is', 'a', 'file', 'with', 'several']
word= lines
words= ['This', 'is', 'a', 'file', 'with', 'several', 'lines']
line= of text.

linewords= ['of', 'text.']
word= of
words= ['This', 'is', 'a', 'file', 'with', 'several', 'lines', 'of']
word= text.
words= ['This', 'is', 'a', 'file', 'with', 'several', 'lines', 'of', 'text.']
line= In Python we can access

linewords= ['In', 'Python', 'we', 'can', 'access']
word= In
words= ['This', 'is', 'a', 'file', 'with', 'several', 'lines', 'of', 'text.', 'In']
word= Python
words= ['This', 'is', 'a', 'file', 'with', 'several', 'lines', 'of', 'text.', 'In', 'Python']
word= we
words= ['This', 'is', 'a', 'file', 

Suppose we want a list of only the capitalized words? We can use the `if` statement to select only those. The `isupper` method returns True if the string it is given consists only of uppercase letters.

In [22]:
words = []
for line in open('sometext.txt', 'r'):
    linewords = line.strip().split()
    for word in linewords:
        if word[0].isupper():
            words.append(word)
print(words)

['This', 'In', 'Python']


Then we modified our code to get only words of length 4 or greater.

In [23]:
words = []
for line in open('sometext.txt', 'r'):
    linewords = line.strip().split()
    for word in linewords:
        if len(word) >= 4:
            words.append(word)
print(words)

['This', 'file', 'with', 'several', 'lines', 'text.', 'Python', 'access', 'sequence', 'lines', 'using', 'open', 'function.']
